Main Memory Limitation with Grid Engine - A short Introduction into cgroups

One of my current projects is implementing cgroups in Univa Grid Engine. It will be available in the next releases (Univa Grid Engine 8.1.7 or 8.2). Control groups are a Linux kernel enhancement which provides some nice features for better resource management. Hence cgroups features can only be used on 64 bit Linux hosts (lx-amd64). This article is a short introduction into one of the supported features for Univa Grid Engine: main memory limitation for jobs.

The cgroups subsystems can be enabled in Grid Engine either on cluster global level or on host local level (for heterogenous clusters) in the host configuration. The current configuration is opened with qconf -mconf global or qconf -mconf <host-name>. You will note following new configuration parameter list:

> qconf -mconf global
 ...
 cgroups_params               cgroup_path=none cpuset=false mount=false \
                              freezer=false killing=false forced_numa=false \
                              h_vmem_limit=false m_mem_free_hard=false \
                              m_mem_free_soft=false min_m_mem_free=0

For enabling cgroups the cgroups_path must be set to the path on which the cgroups subsystems are mounted. On RHEL 6 hosts the default is /cgroup but /sys/fs/cgroup is also a frequently used directory. In order to enable specific subsystems or control their behavior the remaining configuration parameters in the Grid Engine host configuration have to be activated.

There are two ways to restrict main memory for a job:

  • hard limitation: All processes of the job combined are limited from the Linux kernel that they are able to use only the requested amount of memory. Further malloc() calls will fail.

  • soft limitation: The job can also use more memory if it is free on the host - but if the kernel runs out of memory because of other processes, the job is pushed back to the given limits (meaning that the memory overflow is swapped out to disk).

Hard memory limitation is turned on by setting the parameter m_mem_free_hard=true (soft limitation is activated respectively). That's it from the Grid Engine configuration point of view! Now when a job requests main memory with m_mem_free and the job is scheduled to an host with such a configuration, a cgroup in the memory subsystem is created automatically and the job is put in. The limit can be used for batch jobs, parallel jobs and interactive jobs. For parallel jobs the limit is set to the requested amount of memory multiplied by the granted slots for the jobs as expected.

Following example demonstrates a job which behaves well, meaning using less memory than requested hence the job remains unaffected by the limit. (Btw. memhog is a utility which requests a given amount of memory and frees it (repeatedly).)

> qsub -l h=plantation,m_mem_free=1G -b y memhog -r100 990m 
Your job 4 ("memhog") has been submitted

> qacct -j 4
 ==============================================================
 qname        all.q               
 hostname     plantation          
 ...
 jobname      memhog              
 jobnumber    4                   
 ...
 qsub_time    Sun Aug 25 11:39:20 2013
 start_time   Sun Aug 25 11:39:26 2013
 end_time     Sun Aug 25 11:39:50 2013
 ...
 slots        1                   
 failed       0    
 exit_status  0                   
 ru_wallclock 24           
 ru_utime     22.822       
 ru_stime     0.411        
 ru_maxrss    1014384
 ...             
 maxvmem      1015.188M
 ...

In the next example the job is aborted immediately because it used more main memory than requested:

> qsub -l h=plantation,m_mem_free=1G -b y memhog -r100 1050m
Your job 5 ("memhog") has been submitted

> qacct -j 5
==============================================================
qname        all.q               
hostname     plantation
...         
jobname      memhog              
jobnumber    5                   
qsub_time    Sun Aug 25 11:41:53 2013
start_time   Sun Aug 25 11:41:56 2013
end_time     Sun Aug 25 11:41:57 2013
...
slots        1                   
failed       0    
exit_status  137                 
ru_wallclock 1            
ru_utime     0.504        
ru_stime     0.315        
ru_maxrss    1048452
...             
maxvmem      1.050G
…

And finally an interactive session is showed. The interesting part here is that the interactive session as such is not aborted, only the command started by the interactive session is aborted due to the lack of memory.

> qrsh -l h=plantation,m_mem_free=1G 
daniel@plantation:~> memhog -r4 1000m
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
daniel@plantation:~> memhog -r4 1050m
.......................................................................................................Killed
daniel@plantation:~> memhog -r4 1024m
.......................................................................................................Killed
daniel@plantation:~> memhog -r4 1023m
.......................................................................................................Killed
daniel@plantation:~> memhog -r4 1020m
......................................................................................................
......................................................................................................
......................................................................................................
......................................................................................................
daniel@plantation:~> exit
logout
>

There are two additional settings you should know about. Since the Grid Engine process which takes care about starting and tracking your application (sge_shepherd) is also part of the job and accounted in the memory it could produce additional overhead to the memory footprint of your job. Additionally it should be prevented that the user sets realistic low limits for the jobs. Both issues are addressed with the min_m_mem_free limit parameter. If you configure it to let's say 250M, each job gets at least a limit of 250M even the job requested only 10M. Nevertheless for job accounting with qstat or qacct only the job requests are considered. The second setting is the mount=true setting. If turned on the memory subsystem is tried to be mounted automatically by Univa Grid Engine if not already available under <cgroups_path>/memory. On a properly configured host this should be done already during system boot time but setting this parameter prevents from failures (or just simplifies things in some cases).

The other cgroups subsystems used by the upcoming Univa Grid Engine versions I will discuss in one of the next blog entries.