Skip to main content

NAME

xxqs_name_sxx_monitoring - xxQS_NAMExx monitoring file format

DESCRIPTION

The monitoring file is available since xxQS_NAMExx 9.0.0 and is written by the xxQS_NAMExx qmaster to provide metrics about the process and its threads.

Further metrics, e.g. about the cluster state, will be added in future versions.

To enable monitoring data to be written to the monitoring file

  • set monitoring=truein the reporting_params section of the global configuration
  • set MONITOR_TIME=seconds in the qmaster_params section of the global configuration

FORMAT

The monitoring file is a text file. Each line in the file represents a monitoring record entry. Every record is written in a JSON format (one line JSON).

It is located in $SGE_ROOT/$SGE_CELL/common and has the name monitoring.jsonl.

The monitoring file can contain different types of records. In the current version there is per thread monitoring.

Every entry in the monitoring file starts with a timestamp (microseconds since epoch) followed by the record type and the schema version, e.g.

{"time":1735032063284912,"type":"worker-thread","version": "1", ... }

Per Thread monitoring

Per thread monitoring is written for the following thread types:

  • event-master
  • listener
  • mirror
  • timer
  • reader
  • scheduler
  • worker

For every thread a monitoring record contains a set of metrics. In addition, there are thread type specific metrics, e.g. metrics which are only available for worker threads.

The following metrics are available for every thread in a sub-structure data of the monitoring record:

start_time

The time when the monitoring interval was started (microseconds since epoch).

end_time

The time when the monitoring interval was ended (microseconds since epoch).

name

Name of the thread (e.g. "reader-01"). The name is unique for every thread. "reader" is the thread type and "01" is a thread serial number.

hostname

The hostname of the machine where the thread (sge_qmaster) is running. In case of a HA set-up with sge_shadowd the host name will change if a failover occurs.

duration

Duration of the reported monitoring interval in microseconds. The monitoring interval is configured in the global configuration, qmaster_params, MONITOR_TIME=seconds.

idle

The percentage of time the thread was idle during the monitoring interval.

wait

The percentage of time the thread was waiting for a lock or a mutex during the monitoring interval.

busy

The percentage of time the thread was busy during the monitoring interval.

requests_in

The number of incoming requests the thread has processed during the monitoring interval.

answers_out

The number of answers the thread has sent out during the monitoring interval.

runs

The number of runs (passes of the thread's main loop) the thread has performed during the monitoring interval.

Thread type specific metrics (extensions)

For some thread types additional metrics are available. They are stored in the extensions sub-structure of the monitoring record.

Extensions for the event master thread

avg_client_count

The average number of clients which are connected to the event master thread over the reported monitoring interval.

mod_client_count

The number of event client modify requests the event master thread has processed during the monitoring interval.

ack_count

The number of event acknowledgements the event master thread has processed during the monitoring interval.

avg_blocked_client_count

The average number of situations where a client was blocked when the event master wanted to send an event to the client.

avg_busy_client_count

The average number of situations where a client was busy when the event master wanted to deliver an event package to the client.

new_event_count

The number of events which were generated by other threads (usually the worker threads) and which event master processed.

added_event_count

The number of events which were delivered to clients by the event master.
Note that one event generated by a worker thread (see above) can result in multiple events delivered to clients.

skip_event_count

The number of events which were skipped when delivering events to clients as a client didn't subscribe them.

Extensions for listener threads

incoming_gdi

The number of incoming GDI requests the listener thread has processed during the monitoring interval.

incoming_ack

The number of incoming acknowledgements the listener thread has processed during the monitoring interval.

incoming_event_client_exits

The number of incoming event client exits the listener thread has processed during the monitoring interval.

incoming_report

The number of incoming reports (from execution hosts) the listener thread has processed during the monitoring interval.

gdi_get_requests

The number of GDI get requests the listener thread has processed during the monitoring interval.

gdi_trigger_requests

The number of GDI trigger requests the listener thread has processed during the monitoring interval.

gdi_permission_requests

The number of GDI permission requests the listener thread has processed during the monitoring interval.

Extensions for the timer thread

"extensions":{"pending_events":480,"executed_events":48}

pending_events

The number of events which were pending in the timer thread's event queue.

executed_events

The number of events which were executed by the timer thread during the monitoring interval.

Extensions for reader and worker threads

There are 3 types of extensions for reader and worker threads:

  • execd_reports reports sent by execution daemons
  • gdi_requests GDI requests processed by the thread
  • queue_lengths queue lengths of the internal request queues
execd_reports

Reports from an execution daemon can be

  • load reports
  • job reports
  • configuration reports
  • process reports
  • acknowledgement reports
gdi_requests

The number of processed GDI requests are listed for the following request types:

  • add requests
  • get requests
  • modify requests
  • delete requests
  • copy requests
  • trigger requests
  • permission requests
queue_lengths

The queue lengths of the internal request queues are reported for the following queues:

  • writer (containing requests for the worker threads)
  • reader (contains requests for the reader threads)
  • reader_wait (contains requests for the reader threads which will be processed later due to the automatic session management)

Example for a monitoring record from a worker thread (pretty printed):

{
  "time": 1735914863000676,
  "type": "worker-thread",
  "version": "1",
  "data": {
    "start_time": 1735914803568156,
    "end_time": 1735914863000676,
    "name": "worker-00",
    "hostname": "ubuntu-24-amd64-1",
    "duration": 59432520,
    "idle": 99.99912169297212,
    "wait": 0.0000673032205264054,
    "busy": 0.0008110038073519356,
    "requests_in": 12,
    "answers_out": 6,
    "runs": 12,
    "extensions": {
      "execd_reports": {
        "load_reports": 6,
        "job_reports": 0,
        "conf_reports": 6,
        "proc_reports": 0,
        "ack_reports": 0
      },
      "gdi_requests": {
        "add_requests": 6,
        "get_requests": 0,
        "mod_requests": 0,
        "del_requests": 0,
        "cp_requests": 0,
        "trigger_requests": 0,
        "permission_requests": 0
      },
      "queue_lengths": {
        "writer": 1,
        "reader": 0,
        "reader_wait": 0
      }
    }
  }
}
Source: open-source master branch — latest development version.