Running Jobs


Q. How can I submit job in gardar cluster ?

Once you get you username and password, you can use :

qsub <yourscriptfile>

For various options with qsub and sample please click here

Q. How can I access a compute node from the login node?

First log into gardar, and type the following command to login into node compute-1-1:

ssh compute-1-1

Q. How can I get more information about my job?

You can use checkjob command, checkjob job_id. You can also use qstat -f job_id

Q. How can I kill all my jobs?
A. If you want to get rid of all your job , use bit of trick :), here is how you can acheive

qselect -u $USER | xargs qdel

qselect prints out a job list based on specific criterions,whereas xargs takes multi line input and run the command you give to it repeatedly until it has consumed the input list.

So, to delete all your running jobs, use following command

qselect -u $USER -S -R | xargs adel

Then, if you want to delete all your queued job. do the following command:

qselect -u $USER -S -Q | xargs qdel

xargs is very handy when you want to do the same thing a lot of items - so use it :)

Q. How can I kill my job?
A. To terminate one of your job , use qdel command, qdel job_id
Q. How can I run a job on specific nodes?
A. You can use, qsub -lnodes=compute-1-1+compute1-2 to submit your job on compute1-1and compute1-2.
Q. How can I run in Parallel without using MPI ?
Q. How can I see the processes belonging to my job?

The qps commnd can give you the process list belonging to your job: for example:

qps 11107 will gives you something like this (where 11107 is the job_id)

anilth@gardar-1]# qps 111008
*** 111008 shk2 ncpus=12 name=QuantIce
compute-17-1 13:55:14 R    quantice_r4
compute-17-1 00:03:30 S    /bin/bash /home/shk2/bin/watchfolder 10
compute-17-1 18:53:34 R    quantice_r4
compute-17-1 18:53:27 R    quantice_r4
compute-17-1 18:35:19 R    quantice_r4

Q. How can I see the queing situation ?

You can use showq command:

This command gives you a list of running jobs, idle jobs and blocked jobs. Each line in the list gives you the jobid, which user is running the job, number of cpus it is using, time remaining and start time of the job. The list is sorted by remaining time for the jobs.

You can also use qstat command, The output of qstat command looks like :

Job id                                 Name             User            Time Use     S     Queue
------------------------- ---------------- --------------- -------- - -----            ----     --------
110142.gardar-adm          229              espenfl         705:51:3      R     default        
110144.gardar-adm          320              espenfl         702:45:2      R     default        
110145.gardar-adm          100              espenfl         701:33:0      R     default            


Where :

- job id - The name of the job in the queueing system.

- Name - The name of the script you gave to qusb

- User - The job owner

- Time Use - The walltime of the job

- S - the state of the job, R - Running, Q- Waiting, S - Suspended

-Queue -  The queue where the job is running.

Q. How do I exclude a node from running a job ?
A. Sometimes it is useful to exclude a specific node from running your jobs. This can be due to hardware or software problems on that node. For instance the node seems to have problems with the interconnect.

The simplest way to do this is to submit a dummy job to this node:

echo sleep 600 |qsub -lnodes=compute-x-x:ppn=8,walltime=1000

Then this job will be running a sleep job for 600 seconds and you can submit your real job afterwards that will run on other nodes. This will cost you some cpu hours off your quota, but let us know and we will refund this to you later.

Q. I am not able to kill my job - what should I do ?

The normal way to kill a job is by using qdel <job_id> command, if this does not work you should try the following command:

qsig -sNULL <job_id>

This should work most of the time, however if this does not work please get in touch with your country support team.

Q. What is the maximum memory limit for a job?
A. Gardar has 288 nodes with 24GB memory each. Therefore, 24 GB is the maximum memory limit you can ask for a job running on a single node. However, you can have a job running on multiple nodes, using 24GB of memory on each node !!
Q. How can I view the output of my job ?
A. You can use qpeek comand:

The qpeek command can list your job output while it is running. It behaves much like the tail command so using -f will display the output as it is written::
$qpeek -f 109923

See qpeek -h for more info.

You can also submit your job with the -k oe flag, then the standard error and standard output of your job will be put in your home directory, see ``man qsub`` for more details.

Q. When will my job start?

You can use the showstart <job_id> command,

This command will give you the information about how many CPUs your job requires, for how long, as well as when approximately it will start and complete.

(showstart command can be used to find out approximately when the job scheduler think your job will start.)

Q. Which nodes are busy, and which are idle?

You can use diagnose or pbsnodes commands in order to get information abotu the status of the nodes in the gardar,

commands can be used as :

diagnose -n


pbsnodes -a

Q. What shall I do if my ssh connections are dying or freezing ?

You could try adjusting following lines in your computerś .ssh/config file:

ServerAliveCountMax  3
ServerAliveInterval  10

This configures applies if you are using openssh, if you are using putty , check it out here