My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.mattdturner.com/wordpress
and update your bookmarks.

AdSense

Pages

Sunday, April 17, 2011

Killing Non-queued Processes Script

When we first setup our group's HPC here at CU Boulder, we ran into issues where we would have processes hang on the compute nodes.  The jobs would no longer be in the queue, however the processes would still be running on the nodes.  This made other jobs that were supposed to be running on said nodes (queued jobs) run much slower.

As this was a problem that occurred relatively frequently, it was necessary to develop a script that could be placed in cron that would automatically kill the non-queued processes on the compute nodes.

Download the script here.


As a starting point for the script, I found a python script that had been developed by David Black-Schaffer for use on the ROCKS Linux Cluster at Stanford (David's Website).  This script was developed to use the SGE queue environment.  However, our group had decided to use the Torque PBS queue system.


In order to use the script, there are many edits that needed to be made.  This was actually my first exposure to the python language, and took me much longer than it should have.  In order for you to use the script, there are many edits that will need to be made, which depend on your current system setup.  To get you started, here are a few of the changes that might need to be made:

  • Define the list of priority users.  These are users that are exempt from having their processes killed.
  • In line 65, modify the re.match command to match the node names for your cluster.
  • Line 77: modify the re.match command to match the output from the gstat command on your cluster
There are many more changes that will need to be made for this script to work on your system.

I highly suggest running the script in "DEBUG" mode until you are confident that it runs properly on your system.  DEBUG mode will display debug information while the script runs, without actually running the kill commands on the jobs.

Once the script is working properly, you can add it to cron:
  • As root, run "crontab -e"
  • Add a cron entry to run the kill script.  Ex:
    • 0  12  *  *  *    /root/scripts/kill_non_queued_processes.py
  • Restart cron
    • /etc/init.d/crond restart

0 comments:

Post a Comment