Univa Grid Engine supports two levels of checkpointing: the user level
and a operating system provided transparent level. User level check-
pointing refers to applications, which do their own checkpointing by
writing restart files at certain times or algorithmic steps and by
properly processing these restart files when restarted.
Transparent checkpointing has to be provided by the operating system
and is usually integrated in the operating system kernel. An example
for a kernel integrated checkpointing facility is the Hibernator pack-
age from Softway for SGI IRIX platforms.
Checkpointing jobs need to be identified to the Univa Grid Engine sys-
tem by using the -ckpt option of the qsub1() command. The argument to
this flag refers to a so called checkpointing environment, which
defines the attributes of the checkpointing method to be used (see
checkpoint5() for details). Checkpointing environments are setup by
the qconf1() options -ackpt, -dckpt, -mckpt and -sckpt. The qsub1()
option -c can be used to overwrite the when attribute for the refer-
enced checkpointing environment.
If a queue is of the type CHECKPOINTING, jobs need to have the check-
pointing attribute flagged (see the -ckpt option to qsub1()) to be per-
mitted to run in such a queue. As opposed to the behavior for regular
batch jobs, checkpointing jobs are aborted under conditions, for which
batch or interactive jobs are suspended or even stay unaffected. These
o Explicit suspension of the queue or job via qmod1() by the cluster
administration or a queue owner if the x occasion specifier (see
qsub1() -c and checkpoint5()) was assigned to the job.
o A load average value exceeding the suspend threshold as configured
for the corresponding queues (see queue_conf5().)
o Shutdown of the Univa Grid Engine execution daemon sge_execd8()
being responsible for the checkpointing job.
After abortion, the jobs will migrate to other queues unless they were
submitted to one specific queue by an explicit user request. The
migration of jobs leads to a dynamic load balancing. Note: The abor-
tion of checkpointed jobs will free all resources (memory, swap space)
which the job occupies at that time. This is opposed to the situation
for suspended regular jobs, which still cover swap space.
When a job migrates to a queue on another machine at present no files
are transferred automatically to that machine. This means that all
files which are used throughout the entire job including restart files,
executables and scratch files must be visible or transferred explicitly
may suffer long turnaround times.
sge_intro1(,) qconf1(,) qmod1(,) qsub1(,) checkpoint5(,) Univa Grid
Engine Installation and Administration Guide, Univa Grid Engine User's
See sge_intro1() for a full statement of rights and permissions.
UGE 8.0.0 $Date: 2009/06/16 13:58:24 $ SGE_CKPT(1)
Man(1) output converted with