DESCRIPTION
Checkpointing is a facility to save the complete status of an executing
program or job and to restore and restart from this so called check-
point at a later point of time if the original program or job was
halted, e.g. through a system crash.
Univa Grid Engine provides various levels of checkpointing support (see
sge_ckpt(1)). The checkpointing environment described here is a means
to configure the different types of checkpointing in use for your Univa
Grid Engine cluster or parts thereof. For that purpose you can define
the operations which have to be executed in initiating a checkpoint
generation, a migration of a checkpoint to another host or a restart of
a checkpointed application as well as the list of queues which are eli-
gible for a checkpointing method.
Supporting different operating systems may easily force Univa Grid
Engine to introduce operating system dependencies for the configuration
of the checkpointing configuration file and updates of the supported
operating system versions may lead to frequently changing implementa-
tion details. Please refer to the <sge_root>/ckpt directory for more
information.
Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1)
command to manipulate checkpointing environments from the command-line
or use the corresponding qmon(1) dialogue for X-Windows based interac-
tive configuration.
Note, Univa Grid Engine allows backslashes (\) be used to escape new-
line (\newline) characters. The backslash and the newline are replaced
with a space (" ") character before any interpretation.
FORMAT
The format of a checkpoint file is defined as follows:
ckpt_name
The name of the checkpointing environment as defined for ckpt_name in
sge_types(1). qsub(1) -ckpt switch or for the qconf(1) options men-
tioned above.
interface
The type of checkpointing to be used. Currently, the following types
are valid:
hibernator
The Hibernator kernel level checkpointing is interfaced.
cpr The SGI kernel level checkpointing is used.
cray-ckpt
The Cray kernel level checkpointing is assumed.
restart_command (see below), which is not used (even if it is
configured) but the job script is invoked in case of a restart
instead.
ckpt_command
A command-line type command string to be executed by Univa Grid Engine
in order to initiate a checkpoint.
migr_command
A command-line type command string to be executed by Univa Grid Engine
during a migration of a checkpointing job from one host to another.
restart_command
A command-line type command string to be executed by Univa Grid Engine
when restarting a previously checkpointed application.
clean_command
A command-line type command string to be executed by Univa Grid Engine
in order to cleanup after a checkpointed application has finished.
ckpt_dir
A file system location to which checkpoints of potentially considerable
size should be stored.
ckpt_signal
A Unix signal to be sent to a job by Univa Grid Engine to initiate a
checkpoint generation. The value for this field can either be a sym-
bolic name from the list produced by the -l option of the kill(1) com-
mand or an integer number which must be a valid signal on the systems
used for checkpointing.
when
The points of time when checkpoints are expected to be generated.
Valid values for this parameter are composed by the letters s, m, x and
r and any combinations thereof without any separating character in
between. The same letters are allowed for the -c option of the qsub(1)
command which will overwrite the definitions in the used checkpointing
environment. The meaning of the letters is defined as follows:
s A job is checkpointed, aborted and if possible migrated if the
corresponding sge_execd(8) is shut down on the job's machine.
m Checkpoints are generated periodically at the min_cpu_interval
interval defined by the queue (see queue_conf(5)) in which a job
executes.
x A job is checkpointed, aborted and if possible migrated as soon
as the job gets suspended (manually as well as automatically).
r A job will be rescheduled (not checkpointed) when the host on
which the job currently runs went into unknown state and the
time interval reschedule_unknown (see sge_conf(5)) defined in
means to detect this.
SEE ALSO
sge_intro(1), sge_ckpt(1), sge__types(1), qconf(1), qmod(1), qsub(1),
sge_execd(8).
COPYRIGHT
See sge_intro(1) for a full statement of rights and permissions.
UGE 8.0.0 $Date: 2007/02/14 12:58:39 $ CHECKPOINT(5)
Man(1) output converted with
man2html