The Univa Grid Engine Scheduler Configuration: Why artificial load settings matters (2014-11-30)

Grid Engine's scheduler configuration is very flexible. But unfortunately flexibility often comes with interdependencies and hence increases complexity.

In this article I want to have a look at one particular feature of Grid Engine: The artificial load functionality - the job load adjustments.

What’s artificial load?

Load based scheduling is very common, administrators often want to schedule jobs to the least loaded hosts. Hence this is configured in the default scheduler configuration of Grid Engine (man sched_conf -> queue_sort_method).

But what should be considered as "load"? Usually it is the operating system load measurement (man uptime), the 5 min. load average (load_avg) divided by the number of cores (processors) on the host (= np_load_avg). In Grid Engine the load can be set in the load_formula (qconf -msconf) when load based scheduling is turned on in the queue_sort_method (set to load).

As you can see the default load comes from the compute nodes and are reported in larger intervals (configurable in the load_report_time) to the master process and hence the scheduler.

Let's assume you have 48 core machines in your cluster. Each host is allowed to run 48 single-threaded jobs in parallel. The cluster is empty.

Now users are submitting dozens of jobs - what happens in the scheduler?

The scheduler sorts the hosts (queue instances) by the load and starts to schedule the job with the highest priority to the host with the lowest load. Then it takes the second job in priority order to schedule it also the the host with the lowest load. When you look at your watch the scheduling of the second job is around a ms later hence the load situation of the hosts didn’t change (anyhow the scheduler wouldn’t get the information during a scheduling run). What now happens is that the first 48 jobs are filling up the least loaded host.

This is not optimal. In order to approach the problem the Univa Grid Engine scheduler offers an artificial load functionality. Each time the scheduler schedules a job it makes load adjustments to the selected hosts.

The amount and time the artificial load remains on the host is configurable in the job_load_adjustments and load_adjustment_decay_time (man sched_conf).

When np_load_avg=0.5 is set in job_load_adjustments then this means following: Each granted slot for a scheduled job adds an artificial load of 0.5 devided by number of processors of host to np_load_avg. This division is counterintuitive since np_load_avg is already normalized by processors. Nevertheless the bevavior the admin wants to have since the load is finally depended from the machine on which it is scheduled. On a host with 2 processors the same job adds half the load to np_load_avg than on a compute node with 1 processor.

When active the scheduling behavior is like this: Searching least loaded host, trying to accommodate the job, add job_load_adjustments to host load, search least loaded host, …. Now our 48 jobs are distributed around the least loaded hosts of the cluster - excellent!

A connected secret

Univa Grid Engine has another parameter, which is part of the queue configuration, where Grid Engine puts overloaded hosts into an alarm state (man queue_conf): load_thresholds.

When the load + artificial load exceeds the load_threshold then the scheduler will consider this queue instances as overloaded and will not scheduler any jobs to those hosts! Since all this happens in the scheduler it is not visible for you as alarm state (which is a state where a host does not accept any new jobs).

Now let’s configure a load_adjustment of np_load_avg=1 and assume having just one host with 16 slots but 8 cores. In the queue configuration set the load_threshold to 1.0.

After submitting 16 jobs you will see that only 8 jobs are placed on the host - the other jobs a staying pending (and you don’t see any host in alarm state)!

Well, the issue relaxes over time since the load adjustment finally goes away.

But this is really something to have in mind when oversubscribing hosts: Adapt the load_adjustment to your needs! Meaning when having 48 cores und single threaded short jobs you need to drastically reduce the load_adjustment (probably just setting it to 0.01) for np_load_avg. Alternatively you can increase the load_thresholds of your queues. This gives you what you want - optimal placement and usage of your cluster.

To summarize there are two key messages:

  • Artificial load helps you to let the jobs be placed optimally in situations of cluster under utilization.
  • When using artificial load on oversubscribed hosts, please think about your load_threshold settings in the queues depending on the amount of jobs running and the load_adjustment setting.

Try yourself! There are some nice scheduling settings possible with the load threshold. Like if you want to schedule just max. one job to certain hosts in one scheduling run. This you can activate by setting the load adjustment time decay time to a very short time (1 second) and increase the load adjustment to a very large value.