Introducing Flexible Accounting in Gridware Cluster Scheduler: Collect Arbitrary Job Metrics (2025-02-09)

Ever dreamed of capturing custom metrics for your jobs—like user-generated performance counters or application-specific usage data—directly in your accounting logs? Gridware Cluster Scheduler (and its open source companion, Open Cluster Scheduler) just made it a reality with brand-new “flexible accounting” capabilities.

Why Flexible Accounting Matters

In HPC environments, traditional accounting systems can be limiting. They typically capture CPU time, memory usage, perhaps GPU consumption—yet your workflow might demand more: e.g., analyzing model accuracy, network throughput, or domain-specific metrics. With Gridware’s flexible accounting, you can insert arbitrary fields into the system’s accounting file, just by placing a short epilog script in your queue configuration. Then, whenever you run qacct -j <job_id>, these additional metrics appear neatly alongside standard resource usage.

How It Works

In essence, the cluster scheduler calls a admin-defined epilog after each job completes. This small script (it can be in Go, Python, or any language you like) appends as many numeric "key-value" pairs as you wish to the scheduler’s accounting file. For example, you might measure data from your application’s logs (say, images processed or inference accuracy), then push those numbers right into the accounting system. The code snippet below (in Go) demonstrates how easily you can add random metrics—just replace them with values drawn from your own logic:


package main

import (
    "fmt"

    "github.com/hpc-gridware/go-clusterscheduler/pkg/accounting"
    "golang.org/x/exp/rand"
)

func main() {
    usageFilePath, err := accounting.GetUsageFilePath()
    if err != nil {
        fmt.Printf("Failed to get usage file path: %vn", err)
        return
    }
    err = accounting.AppendToAccounting(usageFilePath, []accounting.Record{
        {
            AccountingKey:   "tst_random1",
            AccountingValue: rand.Intn(1000),
        },
        {
            AccountingKey:   "tst_random2",
            AccountingValue: rand.Intn(1000),
        },
    })
    if err != nil {
        fmt.Printf("Failed to append to accounting: %vn", err)
    }
}

Once you have defined your epilog you need to configure it in the cluster queue configuration (qconf -mq all.q).

epilog sgeadmin@/path/to/flexibleaccountingepilog

Here the sgeadmin is the installation user of Gridware / Open Cluster Scheduler as he has the right permissions to do that.

Finally accepting the new values in that particular format in the system needs to be enabled globally (qconf -mconf).

reporting_params ...  usage_patterns=test:tst*

Here we allow tst prefixed values which are then stored in the internal JSONL accounting file in the "test" sections.

That’s all—no core scheduler modifications needed. Run your jobs normally, let them finish, then check out your new columns in qacct.

Unlocking More Insights

This feature is particularly powerful for HPC clusters applying advanced analytics. Need to track per-user image accuracy scores or data ingestion rates? Or capture domain-specific variables for auditing and compliance? Flexible accounting provides a simple, robust mechanism to store all that data consistently.

And remember: Open Cluster Scheduler users get these same advantages—just take care of a little manual configuration. Because this functionality is unique to Gridware and Open Cluster Scheduler, you won’t find it in other legacy Grid Engine variants.

Conclusion

Spend less time mashing logs together and more time exploring richer cluster usage data. Flexible accounting transforms ordinary HPC accounting into a full-blown, customizable metrics infrastructure. Whether you’re fine-tuning AI workflows or verifying compliance, you now have the freedom to store precisely the information you need—right where you expect to see it.

Nav view search

Navigation

Search