Univa Grid Engine 8.1 Features (Part 3) - Simplified Debugging in Case of Failures (2012-05-30)

Another new enhancement of Univa Grid Engine 8.1 is that it simplifies debugging in case of problems during job execution. When you encounter that your job was not executed successfully on the execution host you want to get more details about the job execution context on the execution host. A primary source is usually the active_jobs directory on the execution host where all configuration files are stored temporary during execution time. After the job finished or failed it will be deleted hence you don‘t have any chances to go through the files.

In the past you could omit the deletion of the active_jobs directory by setting the execd_params to keep_active=true on the executions hosts (qconf -mconf <hostname>) where the jobs are failing. This kind of debugging had disadvantages: The amount of directories keep growing constantly even when most jobs were successfully executed and it always required to login to different execution hosts in order to collect the data.

With the new parameter keep_active=error the active_jobs sub-directory is copied to the master host before is is going to be deleted, but only for jobs which had errors. If you want to have all directories collected on the qmaster host before they are going to be deleted on the execution hosts, try to set keep_active=always.

Update: keep_avtive=error is now the default after a UGE 8.1 installation.