The Computer Blog: System Admin

/etc/motd Formatting

It is good practice to announce server maintenance dates and scheduled downtime dates many days prior to the actual downtime. The easiest way to ensure that all users become aware of the scheduled downtime is to place an alert in /etc/motd.

For those of you that don’t know, /etc/motd is the file that gets printed to the screen when a user first logs into the system.

However, in the case of our server (possibly most servers), users tend to ignore the contents of motd since it very rarely changes. In order to attract the attention of any user that logs into the system, I have begun adding formatting (bold, underline, colors, etc) to the motd file on our server.

Formatting can be easily added to any motd using ANSI codes. For example, below is a copy of a sample maintenance announcement for our server:

Notice that the text “Server Maintenance” is what first got your attention. This is because of the different color, bold text, and blinking text (blinking text is not supported in IE, Chrome, and Safari).

Group Access to All Files

On every cluster, there is most likely a group of users who want to allow the entire group to have read, write, and execute privileges to every file they create. However, changing the permissions on each file individually is a major nuisance.

You can, however, change the default set of permissions for every file and folder that a user creates. This is done using the “umask” command.

For our air-quality modeling group here at CU Boulder, each user’s primary group is set to “henze”. Then, within each user’s .cshrc or .bashrc file, the umask command is used to modify the default permissions.

In our case, we use
umask 0007
This makes it so that every file or folder that the user creates has permissions 770 (drwxrwx---).

This is a simple change that helps keep groups of users happy.

Cluster Status Check Script

Lets face it, nodes randomly go down without warning sometimes. If you’re like me, you aren’t going to consistently check the node status throughout the day. There have been cases where a node has gone down, and I didn’t know about it for a week.

This is very problematic, especially when you run a system that has a small number of nodes that are always in use.

In an effort to resolve both the problem of me not knowing there is a dead node without requiring me to check the node status multiple times a day, I developed a python script to check if there is a dead node. The script is set to run every day at midnight, and if there is a dead node it will send me an email.

Download the script here.

The only edits you should have to make in order to get the script to run on your system are lines 21 - 35.

Line 21: Change to match your username
Line 22: Change to match your email address.
Lines 25-35: Change to whatever email message you want sent.

Now all you have to do is add the script to cron.

And you’re done!

Killing Non-queued Processes Script

When we first setup our group's HPC here at CU Boulder, we ran into issues where we would have processes hang on the compute nodes. The jobs would no longer be in the queue, however the processes would still be running on the nodes. This made other jobs that were supposed to be running on said nodes (queued jobs) run much slower.

As this was a problem that occurred relatively frequently, it was necessary to develop a script that could be placed in cron that would automatically kill the non-queued processes on the compute nodes.

Download the script here.

As a starting point for the script, I found a python script that had been developed by David Black-Schaffer for use on the ROCKS Linux Cluster at Stanford (David's Website). This script was developed to use the SGE queue environment. However, our group had decided to use the Torque PBS queue system.

AdSense

Pages

Sunday, April 17, 2011

/etc/motd Formatting

Group Access to All Files

Cluster Status Check Script

Killing Non-queued Processes Script