My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.mattdturner.com/wordpress
and update your bookmarks.

AdSense

Pages

Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Sunday, April 17, 2011

Cluster Status Check Script

Lets face it, nodes randomly go down without warning sometimes. If you’re like me, you aren’t going to consistently check the node status throughout the day. There have been cases where a node has gone down, and I didn’t know about it for a week.


This is very problematic, especially when you run a system that has a small number of nodes that are always in use.

In an effort to resolve both the problem of me not knowing there is a dead node without requiring me to check the node status multiple times a day, I developed a python script to check if there is a dead node. The script is set to run every day at midnight, and if there is a dead node it will send me an email.



Download the script here.


The only edits you should have to make in order to get the script to run on your system are lines 21 - 35.
  • Line 21: Change to match your username
  • Line 22: Change to match your email address.
  • Lines 25-35: Change to whatever email message you want sent.
Now all you have to do is add the script to cron.

And you’re done! 

Killing Non-queued Processes Script

When we first setup our group's HPC here at CU Boulder, we ran into issues where we would have processes hang on the compute nodes.  The jobs would no longer be in the queue, however the processes would still be running on the nodes.  This made other jobs that were supposed to be running on said nodes (queued jobs) run much slower.

As this was a problem that occurred relatively frequently, it was necessary to develop a script that could be placed in cron that would automatically kill the non-queued processes on the compute nodes.

Download the script here.


As a starting point for the script, I found a python script that had been developed by David Black-Schaffer for use on the ROCKS Linux Cluster at Stanford (David's Website).  This script was developed to use the SGE queue environment.  However, our group had decided to use the Torque PBS queue system.