The DRMAA2 Tutorial - Introduction (1) (2013-10-05)

"Evolution is a process of creating patterns of increasing order" (Ray Kurzweil, The Singularity Is Near)

It is obvious that open standards are important in the software industry. They protect investments, they increase usage of interfaces, they decrease costs, they bring people with the same objectives together, and so on. With no or minimal changes the software can support multiple systems / even systems that will be built in the future. The knowledge and cooking recipes are usually widespread - you will find help/solutions in many different communities. Certainly there are many more aspects of open standards. Open standards are taking a major role in the exponential growth in many areas.

DRMAA2 (Distributed Resource Management Application API 2) is such an open standard. It is the successor of the wide-spread DRMAA (Distributed Resource Management and Application API). DRMAA is generally used for submitting jobs (or creating job workflows) into a compute cluster by using a cluster resource management system like Grid Engine (or Condor / PBS / Torque / LSF, …) for applications (like Mathematica, KNIME, …) or for users to build workflows.

DRMAA defines with its language bindings a set of functions for different programming languages. Those functions represent the least common denominator of specific functionalities of cluster schedulers like Grid Engine, PBS, and LSF.

Unlike DRMAA with DRMAA2 you can not only submit jobs, you can also get cluster related information, like getting host names, types and status information or insight about queues configured in the resource management system. You can also monitor jobs not submitted within the application. Overall it covers many more use cases than the old DRMAA standard. When we started several years ago at the Super Computing event in Hamburg with our kick-off meeting we took the results for a survey of DRMAA users as a starting point. Since that time lots of things where added and re-arranged.

One of my current projects is implementing the DRMAA2 into a Grid Engine API. A beta version of the library will be part of Univa Grid Engine 8.2. The target language of the first implementation is C. Support for other programming languages is planned, usually they are wrappers around the C library (using stuff like JNI for Java or cgo for Go/#golang). This article is the first of a series which I want to publish over the next weeks / months where I'm going to introduce the basic usage of the new programming API.

Compiling DRMAA2 Applications for Univa Grid Engine

Compiling DRMAA2 applications for Grid Engine is not much different than for DRMAA applications. The DRMAA2 library will be shipped like DRMAA in the $SGE_ROOT/lib/$ARCH (where $ARCH is lx-amd64 on 64bit Linux) directory, the header file is located in the $SGE_ROOT/include directory.

Your DRMAA2 application can be then compiled (if $SGE_ROOT/default/common/settings.sh is sourced; which is the case when you can do a qsub on command line) with:

gcc -I$SGE_ROOT/include -L$SGE_ROOT/lib/lx-amd64 <example.c> -ldrmaa2

and be started with

export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
./example

If you put the drmaa2.so in a local library path you don't need to export LD_LIBRARY_PATH of course but you need to take care of updating the library after a Grid Engine update.

A First Look at DRMAA2 Job Sessions

DRMAA2 comes with different types of sessions: job sessions, monitoring sessions, and reservation sessions. While job and monitoring sessions are a mandatory part of each DRMAA2 implemenation the reservation session is optional. The availability can be discovered during application run-time (which is part of a later tutorial).

The job session is similar to what the old DRMAA is with the difference that DRMAA2 sessions are persistent. In Univa Grid Engine the job session names is stored in the central qmaster component. This is particular useful when you have different processes which are "sharing" the same type of jobs (i.e. each process wants to track a specific set of jobs). Job sessions are user specific, i.e. different users can't share a session. This is implied by the rights management of Grid Engine. It disallows performing operations like suspend, resume or termination of jobs for other users. Within a job session you can submit jobs, control jobs, and monitor jobs. But only those jobs which are submitted within a job session can be controlled and monitored there. You can have multiple different job sessions open at the same time in one process, while for the monitoring session only one makes sense. With a monitoring session you can track the status and online usage of your own jobs, whether they are submitted in any job session, in DRMAA1, or on command line. Like in job sessions you also can access jobs finished during your application run-time. Within a monitoring session you can't submit or control jobs. If the Grid Engine administrative user opens a monitoring session, it get jobs from all users of the system. This makes the DRMAA2 API a good candidate for writing job monitoring GUIs.

Following code demonstrates how a new job session is created, opened, closed and destroyed in a DRMAA2 application. Creating means that the session is allocated in the Grid Engine qmaster process, opening means that such a persistent session is made available for the DRMAA2 application. Closing leads to communication with qmaster so that the library don't get anymore information about jobs running in this session, and finally destroying is the removal of the job session object on the qmaster. After running the below code the state on qmaster is like before since the session was destroyed. What you are missing is the open call because it is implicitly opened by the drmaa2_create_jsession() call.

In order to leave a session persistent the destroy call can be omitted. It is important to close the session before the appliciation exists because otherwise the Grid Engine master process will keep the underlaying communication connection open for a much longer time than needed (despite there is a timeout) until the Grid Engine master process figures out that the client connection died.

Using DRMAA2 lists and dictionaries

The C implementation of DRMAA2 comes with two higher level data structures: a list and a dictionary. While a dictionary maps a string to another string in a efficient way (which is used for setting resource limits for example), a list contains a collection of strings, job objects or other DRMAA2 specific data types. They are used to simplify creation and access of input and output values for some DRMAA2 functions.

The following code creates a dictionary, adds 2 different key/value pairs, changes the value of an already known key, retrieves the value of this key, checks if a specific key is part of the dict, deletes it and finally destroys the dictionary, i.e. frees it. Since the strings are not allocated the callback method (the second argument) is set to NULL, i.e. nothing needs to be freed when an element or the whole list is deleted.

When creating a list the type must be given as argument. In this example a list of strings is created, filled, the length is retrieved and finally all values are printed.

Following list types are defined in DRMAA2:

typedef enum drmaa2_listtype {
   DRMAA2_STRINGLIST       = 0,
   DRMAA2_JOBLIST          = 1,
   DRMAA2_QUEUEINFOLIST    = 2,
   DRMAA2_MACHINEINFOLIST  = 3,
   DRMAA2_SLOTINFOLIST     = 4,
   DRMAA2_RESERVATIONLIST  = 5
} drmaa2_listtype; 

Querying the list of job sessions

All available job sessions can be requested with drmaa2_get_jsession_list(), which returns a list of type DRMAA2_STRINGLIST. This this can simply processed like explained above. The following code searches for a specific session, if it exists it opens it.