NAME

xxqs_name_sxx_diagnostics - xxQS_NAMExx diagnostics documentation

DESCRIPTION

This document describes how to collect diagnostic information for xxQS_NAMExx installations. It is intended to be used by system administrators and support personnel to gather relevant information about the xxQS_NAMExx installation and its current state.

Error Codes reported in the failed state of jobs

The failed attribute of both qstat -j <job_id> and qacct -j <job_id> commands can contain error codes that indicate the reason for a job failure.

Depending on the error code the job or the queue instance may be set into error state.

The reason for a queue error state can be queried via qstat -explain E.
The error state can be cleared via qmod -cq <queue_name>.

The reason for a job error state can be queried via qstat -j <job_id>.
The error state can be cleared via qmod -cj <job_id>.

The following table lists the error codes and their meaning:

Code	Name / Meaning
`0`	`STATUS_OK`: Job ran through and exited normally
`1`	`SSTATE_FAILURE_BEFORE_JOB`: `sge_execd` cannot start the job. The job or queue instances may be set into error state and further information will be available via `qstat -j <job_id>` and/or `qstat -explain E`.
`2`	`ESSTATE_NO_SHEPHERD`: `sge_shepherd` cannot be executed, see `sge_execd` messages file for details.
`3`	`SSTATE_NO_CONFIG`: `sge_execd` could not write the `sge_shepherd` config file.
`4`	`SSTATE_NO_PID`: `sge_shepherd` did not write its `pid` file (poss. as it crashed), see the `sge_shepherd` trace file for details.
`5`	`SSTATE_READ_CONFIG`: `sge_shepherd` cannot read its `config` file.
`6`	`SSTATE_PROCSET_NOTSET`: On Solaris: `sge_shepherd` could not create a processor set.
`7`	`SSTATE_BEFORE_PROLOG`: `sge_shepherd` could not start a prolog.
`8`	`SSTATE_PROLOG_FAILED`: A prolog was started by `sge_shepherd` but failed.
`9`	`SSTATE_BEFORE_PESTART`: `sge_shepherd` could not start a PE start procedure.
`10`	`SSTATE_PESTART_FAILED`: A PE start procedure was started by `sge_shepherd` but failed.
`11`	`SSTATE_BEFORE_JOB`: `sge_shepherd` could not start the job. More information can be found in the `sge_shepherd` trace file.
`12`	`SSTATE_BEFORE_PESTOP`: `sge_shepherd` could not start a PE stop procedure.
`13`	`SSTATE_PESTOP_FAILED`: A PE stop procedure was started by `sge_shepherd` but failed.
`14`	`SSTATE_BEFORE_EPILOG`: `sge_shepherd` could not start an epilog.
`15`	`SSTATE_EPILOG_FAILED`: An epilog was started by `sge_shepherd` but failed.
`16`	`SSTATE_PROCSET_NOTFREED`: On Solaris: `sge_shepherd` could not release a previously created processor set.
`17`	`ESSTATE_DIED_THRU_SIGNAL`: The job died through a signal.
`18`	`ESSTATE_SHEPHERD_EXIT`: `sge_shepherd` exited with exit status > 0.
`19`	`ESSTATE_NO_EXITSTATUS`: `sge_shepherd` didn't write its `exit_status` file - possibly crashed before exiting regularly.
`20`	`ESSTATE_UNEXP_ERRORFILE`: The `sge_shepherd` `error` file couldn't be read.
`21`	`ESSTATE_UNKNOWN_JOB`: `sge_execd` got a message from `sge_qmaster` about a job it doesn't know about.
`22`	`ESSTATE_EXECD_LOST_RUNNING`: Job removed manually.
`23`	`ESSTATE_PTF_CANT_GET_PIDS`: PTF can't get information for certain pids.
`24`	`SSTATE_MIGRATE`: The job was checkpointed for migration.
`25`	`SSTATE_AGAIN`: The job shall be re-started.
`26`	`SSTATE_OPEN_OUTPUT`: Error, input, or output file couldn't be opened by `sge_shepherd`.
`27`	`SSTATE_NO_SHELL`: The requested shell could not be found by `sge_shepherd`.
`28`	`SSTATE_NO_CWD`: `sge_shepherd` cannot change directory to the requested job directory.
`29`	`SSTATE_AFS_PROBLEM`: AFS setup failed.
`30`	`SSTATE_APPERROR`: The job exited with exit_status 100 (application error)
`36`	`SSTATE_CHECK_DAEMON_CONFIG`: The daemon for an interactive job could not be found (if `rsh_daemon`, `rlogin_daemon`, `qlogin_daemon` is configured to a daemon path, instead of `builtin`)
`37`	`SSTATE_QMASTER_ENFORCED_LIMIT:` sge_qmaster` enforced killing the job due to a limit.
`38`	`SSTATE_ADD_GRP_SET_ERROR`: `sge_shepherd` cannot attach the additional group id to the `sge_shepherd` child process becoming the job.
`100`	`SSTATE_FAILURE_AFTER_JOB`: The job ran through, but no `usage` file was written by `sge_shepherd`.

More details about errors reported by sge_execd can be found in the sge_execd messages file.
For errors reported by sge_shepherd please check the sge_shepherd trace file or the error mail (if requested at job submission) or administrator mail (if configured in the global configuration).

Administrator Mail

The administrator mail features a mail message for each failed job. The mail message contains

general information
- about the job, e.g., job owner, queue, start time and end time (if available)
- about the error, e.g., failed in prolog:2025-12-02 16:45:32.632767 [6001:104882]: execvp(/no/such/prolog, "/no/such/prolog") failed: No such file or directory
- actions taken due to the error, e.g., Job 4 caused action: Queue "all.q@<hostname>" set to ERROR
the shepherd trace file
the shepherd error file
the pe_hostfile

The administrator mail address can be configured in the global configuration file (see sge_conf.5).

For most of the error codes the administrator mail is sent for every failed job.

For specific configuration-related failures (prolog and epilog configuration), the administrator mail is sent only once for the first failed job. It will be sent again if the configuration is changed (either a local or global configuration, or a queue is changed).

For a few error codes no administrator mail is sent.

The following table lists the error codes and the frequency of sending. See Error Codes reported in the failed state of jobs for a list of the error codes and their meaning.

Code	Frequency
SSTATE_FAILURE_BEFORE_JOB	NEVER
ESSTATE_NO_SHEPHERD	NEVER
ESSTATE_NO_CONFIG	ALWAYS
ESSTATE_NO_PID	ALWAYS
SSTATE_PROCSET_NOT_SET	ALWAYS
SSTATE_BEFORE_PROLOG	ALWAYS
SSTATE_PROLOG_FAILED	ONCE
SSTATE_BEFORE_PESTART	ALWAYS
SSTATE_PESTART_FAILED	ALWAYS
SSTATE_BEFORE_JOB	ALWAYS
SSTATE_BEFORE_PESTOP	ALWAYS
SSTATE_PESTOP_FAILED	ALWAYS
SSTATE_BEFORE_EPILOG	ALWAYS
SSTATE_EPILOG_FAILED	ONCE
SSTATE_PROCSET_NOTFREED	ALWAYS
ESSTATE_DIED_THRU_SIGNAL	ALWAYS
ESSTATE_SHEPHERD_EXIT	ALWAYS
ESSTATE_NO_EXITSTATUS	ALWAYS
ESSTATE_UNEXP_ERRORFILE	ALWAYS
ESSTATE_UNKNOWN_JOB	ALWAYS
ESSTATE_EXECD_LOST_RUNNING	NEVER
ESSTATE_PTF_CANT_GET_PIDS	NEVER
SSTATE_MIGRATE	NEVER
SSTATE_AGAIN	NEVER
SSTATE_OPEN_OUTPUT	ALWAYS
SSTATE_NO_SHELL	ALWAYS
SSTATE_NO_CWD	ALWAYS
SSTATE_AFS_PROBLEM	ALWAYS
SSTATE_APPERROR	ALWAYS
SSTATE_CHECK_DAEMON_CONFIG	NEVER

ENVIRONMENTAL VARIABLES

For a complete list of common environment variables used by all xxQS_NAMExx commands, see xxqs_name_sxx_intro(1).

FILES

The sge_shepherd trace file is located in <sge_shepherd_spool_dir>/active_jobs/<job_id>.<array_task_id>/trace (where <array_task_id> is 1 for non-array jobs).
Set the execd_params attribute KEEP_ACTIVE to keep the active job directories after job termination. See xxqs_name_sxx_conf(5) for details.

The sge_execd messages file is located in the sge_execd spool directory.

COPYRIGHT

See xxqs_name_sxx_intro(1) for a full statement of rights and permissions.