Univa Grid Engine 8.1.3 - Faster and Several New Features (2012-11-20)

Version 8.1.3 of Grid Engine is out! It comes with some performance improvements in the scheduler, some new features, and of course bug fixes. The most important thing first: The scheduler is now in certain situations, namely in big and busy clusters (with more than just a hundred of compute nodes) having lots of exclusive jobs much faster. In (artificial) test-scenarios with 2000 simulated hosts and 3000 jobs a scheduling run went down by a factor of 10-20. In real-world scenarios the performance gain still could be highly visible.

The new features includes direct support of Intel Xeon Phi and general features related for making optimal use of such co-processing devices in a Grid Engine managed environment. There is a text UI installer, where you can select the resources (like different temperatures, power consumption in watt etc), which should be reported by the hosts the devices are installed on. Of course a configurable load sensor reporting these metrics is not missing. The generic Grid Engine enhancements cover resource map topology masks and per host consumables.

Automatic Resource Based Core Binding

Since UGE 8.1 the new RSMAP consumable type is available in Univa Grid Engine. A RSMAP or resource map is the same like an integer consumable (where the number of a resource type can be controlled), but offers additional functionality. Each instance of a resource has an ID (like a simple number) which gets attached to the job.

Example: When having 4 Phi cards on a host not only the amount (like in Sun Grid Engine 6.2u5) can be managed, each Phi can be assigned a special ID in the complex_values field of the host. Hence complex_values phi=4(oahu-mic0 oahu-mic1 oahu-mic2 oahu-mic3) assigns the phi complex 4 different ids: oahu-mic0 to oahu-mic3 (qconf -me oahu opens the host configuration for doing that). When submitting a job with qsub -l phi=1 the scheduler assigns a free RSMAP id to the job. The job can get the decision by reading out the SGE_HGR_phi  environment variable (SGE_HGR_<complexname>, HGR means hard granted resource). Finally the job can start the Phi binary by ssh $SGE_HGR_phi <phi_programm> (the Phi device runs a small Linux having its own network address and is named usually <hostname>-mic<micno>). The qstat -j output shows the device the job gets assigned by the scheduler.

When it comes to hybrid jobs often communication happens between the part of the job running on the host and the part of the job running on the Phi device. Hence the host-part should run on that cores of the NUMA host which are near to the selected Phi device. This can now be configured by the administrator in a very easy and flexible way. The only thing what needs to be done is attaching a topology mask to the resource ids in the host configuration. An additional requirement is that the RSMAP is configured as a per host consumable in the complex configuration (qconf -mc), but more about this in the next section.

Topology masks are configured in the complex_values field of the host configuration where the consumable was initialized. In the example below we have 4 Intel Xeon Phi devices on the host oahu. The first two are near to the first socket the remaining two are near to the second socket. This information can be found in the mic subdirectory of the sys filesystem. The configuration is made available for Univa Grid Engine 8.1.3 in the following way:

qconf -me oahu 
   complex_values phi=4(mic0:SCCCCccccScccccccc mic1:SccccCCCCScccccccc \
   mic2:SccccccccSCCCCcccc mic3:SccccccccSccccCCCC)

Here you can see that 4 resources of the type phi (which must exist in the complex configuration qconf -mc) are configured. The IDs are mic0 to mic3. For mic0 the first 4 cores of the first socket are available. A upper C means that this core can be used, while a lower c means that this core in not available for a job which gets mic0 granted. The topology string must be appended to the ID by a colon. Another requirement to get this work is that RSMAP is defined as a per HOST consumable (see below). So when submitting a job requesting -l phi=1 it and it gets mic0 granted, the job is automatically bound to the first 4 cores on the first socket, the second job on the remaining 4 cores on the first socket and so on (without a need that the qsub has to specify anything other then the phi resource with -l phi=1). When you are adding additionally on the command line a core binding request then the scheduler only selects cores from the allowed (by this topology mask) subset. Hence when more cores are requested than granted by the topology mask, the job will not start on this host. If just one core is requested (like qsub -binding linear:1 -l phi=1 ...) then the job is bound to only one core, which must be one of the cores allowed by the topology mask. Of course more phis can be requested, hence when doing a qsub -l phi=2 ..., the job is bound to the 8 cores available for the 2 phi cards.

Per Host Consumable Complexes

The RSMAP complex type offers since version 8.1.3 support for a new value in the consumable field of the complex configuration (qconf -mc): HOST. This is the first consumable type in Grid Engine which offers a per host resource request out of the box.

The consumable column usually can be set to NO which means that the complex is not a consumable. This is  needed for host or queue specific properties, a RSMAP can‘t be set to NO because must be always a consumable. Setting the consumable column to YES means that this complex is always accounted. I.e. depending where you initialize the value of the complex (the amount of resources it represents) in the queue configuration (qconf -mq <q> ... complex_values...) or in the host configuration (qconf -me <hostname/global> ... complex_values ...) the amount is reduced by each job by the amount of slots the job uses in the particular container. For example when you have a parallel job spanning 2 hosts, where on host 1 the job gots 4 slots and on host 2 it got 2 slots, the host 1 resource is reduced by 4 and host 2 is reduced by 2 when you‘d requested 1 of the particular resource for the job (qsub -l resource=1 -pe roundrobin 6 ...). When the consumable column is set to JOB then the amount of requested resources (in this case 1) is just reduced on the host where the master task runs.

With new HOST consumable type the amount of requested resources is decremented on all hosts where the job runs independent of the amount of slots the job got granted. In the example the resource will be decremented by one on host 1 and host 2. This is useful for co-processor devices like Intel Xeon Phi or other devices like GPUs, where you want to specify how many of the co-processors should be used on each host.