Tracing execd -> qmaster protocol with qping (2013-09-23)

Sometimes it is useful to display the communcation between two Grid Engine daemons. For example when load sensor values are not visable in qstat (or wrong values appear).

In order to print out a protocol trace you can connect to the execd with the qping tool, which is shipped with Grid Engine.

First you need to switch to the root account on the execd. Then you have to enable full reporting for qping by setting a specific environment variable:

export SGE_QPING_OUTPUT_FORMAT="s:12 s:13"

This enables columns 12 and 13. See qping -help for more information.

Then you can connect to the execd with following command:

qping -dump_tag ALL INFO myhostname $SGE_EXECD_PORT execd 1

Don't forget to source all Grid Engine environment variables (e.g source $SGE_ROOT/default/common/settings.sh, where $SGE_ROOT is the path to your installation) beforehand.

You will get an output like following when qping is executed:

open connection to "u1010/execd/1" ... no error happened
…
List: <report list> * #Elements: 2
REP_type             (Ulong)     = 1
REP_host             (Host)      = u1010
REP_list             (List)      = full {

List: <No list name specified> * #Elements: 8
-------------------------------
LR_name              (String)    = load_long
LR_value             (String)    = 0.070000
LR_global            (Ulong)     = 0
LR_static            (Ulong)     = 0
LR_host              (Host)      = u1010
-------------------------------
LR_name              (String)    = mem_free
LR_value             (String)    = 388.816406M
LR_global            (Ulong)     = 0
LR_static            (Ulong)     = 0
LR_host              (Host)      = u1010
-------------------------------
LR_name              (String)    = virtual_free
LR_value             (String)    = 779.726562M
LR_global            (Ulong)     = 0
LR_static            (Ulong)     = 0
LR_host              (Host)      = u1010
-------------------------------
LR_name              (String)    = mem_used
LR_value             (String)    = 103.550781M
LR_global            (Ulong)     = 0
LR_static            (Ulong)     = 0
LR_host              (Host)      = u1010
-------------------------------
LR_name              (String)    = virtual_used
LR_value             (String)    = 110.636719M
LR_global            (Ulong)     = 0
LR_static            (Ulong)     = 0
LR_host              (Host)      = u1010
-------------------------------
LR_name              (String)    = m_mem_used
LR_value             (String)    = 101.000000M
LR_global            (Ulong)     = 0
LR_static            (Ulong)     = 0
LR_host              (Host)      = u1010
-------------------------------
LR_name              (String)    = m_mem_free
LR_value             (String)    = 391.000000M
LR_global            (Ulong)     = 0
LR_static            (Ulong)     = 0
LR_host              (Host)      = u1010
-------------------------------

After a while you can see a similar output which is more or less self describing. You can see the reosurce name the values, the host where they come from, and some tags (if it is a cluster global value or a static complex).