As this was a problem that occurred relatively frequently, it was necessary to develop a script that could be placed in cron that would automatically kill the non-queued processes on the compute nodes.
Download the script here.
As a starting point for the script, I found a python script that had been developed by David Black-Schaffer for use on the ROCKS Linux Cluster at Stanford (David's Website). This script was developed to use the SGE queue environment. However, our group had decided to use the Torque PBS queue system.
In order to use the script, there are many edits that needed to be made. This was actually my first exposure to the python language, and took me much longer than it should have. In order for you to use the script, there are many edits that will need to be made, which depend on your current system setup. To get you started, here are a few of the changes that might need to be made:
- Define the list of priority users. These are users that are exempt from having their processes killed.
- In line 65, modify the re.match command to match the node names for your cluster.
- Line 77: modify the re.match command to match the output from the gstat command on your cluster
There are many more changes that will need to be made for this script to work on your system.
I highly suggest running the script in "DEBUG" mode until you are confident that it runs properly on your system. DEBUG mode will display debug information while the script runs, without actually running the kill commands on the jobs.
Once the script is working properly, you can add it to cron:
- As root, run "crontab -e"
- Add a cron entry to run the kill script. Ex:
- 0 12 * * * /root/scripts/kill_non_queued_processes.py
- Restart cron
- /etc/init.d/crond restart
0 comments:
Post a Comment