Daniel's Blog

Gridware Cluster Scheduler 9.0.9: Smarter License Management and Enhanced Debugging (2025-11-16)

We're pleased to announce the release of Gridware Cluster Scheduler 9.0.9 (based on Open Cluster Scheduler 9.0.9 - fka Sun Grid Engine), bringing powerful new capabilities for license management, job debugging, and resource efficiency to HPC environments.

FlexLM Integration: Zero-Configuration License Tracking

The headline feature in 9.0.9 is native FlexLM integration that fundamentally simplifies license management in cluster environments. The scheduler now automatically detects, configures, and tracks FlexLM licenses without any manual configuration.

Setup couldn't be simpler—just point the scheduler at your FlexLM server, and it handles the rest. No complex manual license configuration in GCS is needed. The scheduler automatically discovers available licenses, tracks their usage, and ensures jobs requesting licensed software only run when licenses are available. This tight integration prevents the all-too-common scenario of jobs failing mid-execution because licenses became unavailable.

Faulty Job Load Sensor: Automated Debugging Support

Debugging failed jobs just got significantly easier with the new faulty job load sensor. When jobs fail, the scheduler automatically copies all relevant trace files and job environment information to a configurable location.

No more hunting through log directories or reconstructing job environments after the fact. Everything you need for post-mortem analysis is automatically preserved and organized, dramatically reducing the time from job failure to root cause identification.

Dynamic Resource Release for Running Jobs

Version 9.0.9 introduces the ability to drop costly resources—like license requests—from already running jobs. This capability addresses a common inefficiency in HPC environments: jobs that request resources for their entire runtime but only actually use them for a portion of execution.

With dynamic resource release, running jobs can voluntarily give up resources they no longer need, making those resources immediately available to other waiting jobs. For expensive resources like commercial software licenses, this translates directly to reduced per-job costs and significantly improved resource utilization across the cluster.

Enhanced Stability and Scheduling Improvements

Beyond the major features, 9.0.9 includes important stability fixes and scheduling enhancements:

Large Group Support: Critical bug fixes ensure stable operation in environments where users belong to thousands of UNIX groups—a common scenario in large enterprise and academic deployments.

Job Simulation Improvements: Simulated jobs are now ending exactly at the provided runtime despite qmaster restarts in-between.

Parallel Job Scheduling: Fixes for parallel job scheduling logic, particularly when ignoring worker task requests on master hosts, ensure more reliable distributed job execution.

Download and Upgrade

Gridware Cluster Scheduler 9.0.9 and Open Cluster Scheduler 9.0.9 are available for download today at: https://hpc-gridware.com/download-ocs-9-0-9

For existing 9.0.x deployments, upgrading to 9.0.9 follows the standard binary replacement approach—stop services, replace binaries, restart services. No configuration changes are required for basic upgrades, though you'll want to configure the new FlexLM integration and faulty job load sensor to take advantage of these capabilities.

The combination of intelligent license management, automated debugging support, and dynamic resource optimization makes 9.0.9 particularly valuable for environments running commercial applications with expensive licensing models. These features work together to reduce both operational costs and administrative overhead while improving cluster utilization.

To munge or not to munge (2025-11-6)

Gridware Cluster Scheduler supports munge authentication in daily builds since a while, bringing enhanced security to containerized workloads. This widely-adopted service runs as a separate daemon, verifying real UID/GID for cluster tools like qsub and qrsh across multi-node environments.

The security benefits are particularly valuable when allowing user namespaces for containers, ensuring safe operation without authentication compromises. Best of all, installation couldn't be simpler—just add the -munge flag to your installer command:

./inst_sge -munge -m ...

For anyone running containerized HPC workloads, munge support is highly recommended and easy enough that the answer to "to munge or not to munge" is clearly: munge.

Modern CPUs Need Smarter Scheduling (2025-10-24)

Current and future CPUs offer more than just cores and sockets. HPC schedulers require information about L3, L2 caches, cache groups (like AMD's CCX), and power/efficiency cores to allow scenarios like "run only on power cores" or "pack application inside one core complex" to get the best and most stable performance results.

Gridware Cluster Scheduler 9.1.0 introduces topology-aware scheduling that understands these complex hardware layouts. The scheduler uses a simple notation to represent CPU topology:

Letter	Meaning
N	NUMA node
S	Socket
X	L3 cache / Chiplet
Y	L2 cache group
C	Power core
E	Efficiency core
T	Thread (SMT)

Example topology string: NSXCTTCTTCTTCTTEYEEYEE

This represents a socket with power cores (dual-threaded) and efficiency cores (grouped by L2 cache), all sharing an L3 cache—exactly the kind of heterogeneous design found in modern Intel, AMD, and ARM processors.

Check out Ernst Bablick's comprehensive blog post , where the engineer behind this feature explains the implementation with real-world examples from NVIDIA DGX Spark, Intel i9-14900HX, and AMD EPYC Zen5 systems.

Update 1: Please also check the follow up article Compute Nodes with Heterogeneous Topology in Gridware Cluster Scheduler

Update 2: Please also check the second follow up article How Binding Order and Range Shape Job Placement in Gridware Cluster Scheduler

Multi-Node Concepts: From Grid Engine Legacy to the AI Age (2025-09-29)

Grid Engine introduced the parallel job concept to the scheduler domain decades ago, laying the foundational groundwork. In today's AI age, multi-node computations are the essential building blocks that allow us to train, finetune, and run inference at scale.

But the complexity hasn't disappeared—it's just shifted. When you need to understand the precise internal concepts behind robust, scalable multi-node job orchestration in modern environments, please check out my latest post over at hpc-gridware.com.

We take a deep dive into the Gridware Cluster Scheduler / Open Cluster Scheduler machinery that makes distributed computing reliable:

PEs and Allocation Rules: How the Parallel Environment dictates slot distribution (e.g., $fill_up vs. $round_robin) and controls your resource footprint.
The Consumable Logic: A detailed look at how to define and request resources using different consumable scopes (YES, HOST, JOB) to manage everything from memory to licenses.
Controlling the Slaves: The critical role of qrsh -inherit and control_slaves in enforcing per-node resource limits and ensuring complete job cleanup.
RSMAP for Specialized Resources: Managing non-uniform resources like GPUs, network devices, and ports with the powerful RSMAP resource type.

This is the technical knowledge required to move your multi-node jobs from basic execution to optimized, production-grade workflows.

Read the full post on multi-node job concepts: here

SLURM to Open Cluster Scheduler / Gridware Cluster Scheduler Migration Guide (2025-08-04)

Updated 2025-08-18

Migrating to Open Cluster Scheduler (OCS) or Gridware Cluster Scheduler (GCS) is straightforward and provides significant advantages for HPC environments. The schedulers offer sophisticated job prioritization algorithms, exceptional scalability, and robust open source foundations. Most importantly, they maintain SGE CLI/API compatibility for existing workflows while adding modern scheduler capabilities.

Broad Platform Support

OCS and GCS support an extensive range of platforms, making migration feasible regardless of your current infrastructure. From legacy CentOS 7 installations to the latest Ubuntu releases, both schedulers provide consistent functionality. Architecture support spans AMD64 and ARM64, with options for RISC-V and PowerPC, ensuring compatibility across diverse hardware environments including traditional x86 clusters, ARM-based systems, and emerging RISC-V platforms.

Command Migration Reference

Basic Commands

Function	SLURM	OCS/GCS	Notes
Submit job	`sbatch script.sh`	`qsub script.sh`	Direct replacement
Delete job	`scancel 12345`	`qdel 12345`	Direct replacement
Job status	`squeue -u user`	`qstat -u user`	Familiar SGE syntax
Job details	`scontrol show job 12345`	`qstat -j 12345`	Detailed per-job diagnostics
Interactive job	`srun --pty bash`	`qrsh -cwd -V bash`	Recommended parity flags; add resources, e.g. `-pe smp 4 -l h_vmem=4G`
Hold job	`scontrol hold 12345`	`qhold 12345`	Direct equivalent
Release job	`scontrol release 12345`	`qrls 12345`	Direct equivalent
Cluster status	`sinfo`	`qhost`	Summary view; for full attributes use `qhost -F` or `qconf -se`
Queue list	`scontrol show partition`	`qconf -sql`	Simpler command
Node details	`scontrol show node node01`	`qhost -h node01`	For full host config use `qconf -se node01`
Host config	`scontrol show node node01`	`qconf -se node01`	Full host config (complexes, load sensors, consumables)
Accounting	`sacct -j 12345`	`qacct -j 12345`	Post-mortem from accounting file; for live info use `qstat -j` or ARCo if configured

Job Script Conversion

SLURM Directive	OCS/GCS Equivalent	Example
`#SBATCH`	`#$`	Script directive marker
`--job-name=test`	`-N test`	Job naming
`--output=job.out`	`-o job.out`	Standard output
`--error=job.err`	`-e job.err`	Standard error
`--time=24:00:00`	`-l h_rt=24:00:00`	Runtime limit
`--nodes=2`	`-pe mpi 48`	Slots via PE; node count shaped by PE allocation_rule or per-host slot limits
`--ntasks=48`	`-pe mpi 48`	Task/slot count
`--cpus-per-task=4`	`-pe smp 4`	OpenMP/multi-threaded per-task CPUs
`--mem=4000`	`-l h_vmem=4G`	Per-task hard memory limit
`--partition=compute`	`-q compute.q`	Queue selection
`--account=project1`	`-P project1`	Project assignment
`--array=1-100`	`-t 1-100`	Job arrays
`--mail-type=ALL`	`-m bea`	Email notifications
`--mail-user=user@domain`	`-M user@domain`	Email address
`--gres=gpu:2`	`-l gpu=2`	GPU request (requires configured `gpu` complex)

Environment Variables Migration

SLURM Variable	OCS/GCS Variable	Purpose
`$SLURM_JOB_ID`	`$JOB_ID`	Unique job identifier
`$SLURM_JOB_NAME`	`$JOB_NAME`	Job name
`$SLURM_NTASKS`	`$NSLOTS`	Number of allocated slots/cores
`$SLURM_JOB_NODELIST`	`$PE_HOSTFILE`	Path to host/slot file; use `cat "$PE_HOSTFILE"`
`$SLURM_ARRAY_TASK_ID`	`$SGE_TASK_ID`	Array job task ID
`$SLURM_SUBMIT_DIR`	`$SGE_O_WORKDIR`	Submission directory
`$SLURM_JOB_PARTITION`	`$QUEUE`	Queue instance (e.g., `compute.q@node01`)
`$SLURM_JOB_USER`	`$USER`	Job owner (runtime user)
`$SLURM_ARRAY_JOB_ID`	`$JOB_ID`	Array parent job ID
`$SLURM_SUBMIT_HOST`	`$SGE_O_HOST`	Submission host
`$SLURM_CPUS_PER_TASK`	(no direct)	Use `-pe smp N` and `$NSLOTS` within a single-task job
(submitter)	`$SGE_O_LOGNAME`	Submitting user login name

Job Script Examples

Serial Job Migration

SLURM Script

#!/bin/bash
#SBATCH --job-name=serial_job
#SBATCH --output=job.out
#SBATCH --error=job.err
#SBATCH --time=1:00:00
#SBATCH --mem=2000
#SBATCH --ntasks=1

./my_application

OCS/GCS Script

#!/bin/bash
#$ -N serial_job
#$ -cwd
#$ -V
#$ -o job.out
#$ -e job.err
#$ -l h_rt=1:00:00
#$ -l h_vmem=2G

./my_application

Parallel Job Migration

SLURM Script

#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --nodes=2
#SBATCH --ntasks=48
#SBATCH --time=4:00:00
#SBATCH --partition=compute

mpirun ./parallel_app

OCS/GCS Script

#!/bin/bash
#$ -N mpi_job
#$ -cwd
#$ -V
#$ -pe mpi 48
#$ -l h_rt=4:00:00
#$ -q compute.q
# Optional CPU binding if not defined in the PE:
# $ -binding linear:1

# If your MPI is SGE-aware, it will read $PE_HOSTFILE automatically.
# Otherwise, pass the hostfile explicitly:
mpirun -np "$NSLOTS" --hostfile "$PE_HOSTFILE" ./parallel_app

OpenMP / CPUs-per-task Migration

SLURM Script

#!/bin/bash
#SBATCH --job-name=omp_job
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=4000

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./omp_app

OCS/GCS Script

#!/bin/bash
#$ -N omp_job
#$ -cwd
#$ -V
#$ -pe smp 8
#$ -l h_rt=2:00:00
#$ -l h_vmem=4G

export OMP_NUM_THREADS="$NSLOTS"

./omp_app

Job Array Migration

SLURM Script

#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --array=1-100
#SBATCH --output=job_%A_%a.out
#SBATCH --time=1:00:00

./process_file "$SLURM_ARRAY_TASK_ID"

OCS/GCS Script

#!/bin/bash
#$ -N array_job
#$ -cwd
#$ -V
#$ -t 1-100
#$ -o job_${JOB_ID}.${SGE_TASK_ID}.out
#$ -l h_rt=1:00:00

./process_file "$SGE_TASK_ID"

Queue Configuration Migration

SLURM Partition	OCS/GCS Queue	Configuration
Partition name	Queue name	Use familiar `.q` suffix
Node assignment	Hostgroup	More flexible node grouping
Resource limits	Queue limits	Comprehensive resource control
Priority settings	Share tree	Hierarchical fair-share
Access control	User lists	Fine-grained permissions

MPI Integrations

Open Cluster Scheduler and Gridware Cluster Scheduler provide comprehensive, native support for major MPI implementations through both tight and loose integration modes. Tight integration ensures parallel tasks are fully accounted for and resource limits are enforced via qrsh -inherit and PE-aware launchers; loose integration provides flexibility for applications that manage their own task distribution.

Schedulers include ready-to-use parallel environment templates and build scripts for Intel MPI, MPICH, MVAPICH, and Open MPI. Each MPI implementation can be installed and configured using simple qconf commands to add parallel environments to queues. For applications that don't natively support SGE integration, an SSH wrapper is provided that transparently converts SSH calls to qrsh -inherit, enabling tight integration without application modifications.

Complete MPI integration templates, build scripts, and example jobs are available in the official repository. This includes everything needed to deploy and test MPI workloads, from basic parallel environment configuration to advanced checkpointing setups.

Gotchas and Nuances

Memory: h_vmem sets a hard per-process limit. mem_free is typically a consumable/availability indicator and may scale with slots depending on complex configuration.
Accounting: qacct reports after jobs finish (reads the accounting file). For near-live info, use qstat -j, the accounting file directly, or ARCo if configured.
Queue variable: $QUEUE contains the queue instance (queue@host). To get just the queue name, use shell parsing like ${QUEUE%@*}.
Node list: There is no direct string nodelist. Use $PE_HOSTFILE (host slots file), $NSLOTS, and if needed count hosts from the file.
Environment and working directory: Add -cwd to run in the submission directory and -V if you rely on the submit-time environment.
Affinity: Use -binding (e.g., -binding linear:1 or striding) or define binding in the PE to mirror Slurm CPU affinity behavior.
Arrays: Use $SGE_TASK_ID. Avoid $TASK_ID as it is not guaranteed across deployments.
Nodes vs tasks: -pe requests slots. To emulate a specific nodes×tasks layout, configure the PE's allocation_rule and per-host slot caps accordingly.

From Sun Grid Engine to Gridware Cluster Scheduler: Why Job Priority Configuration Still Matters (2025-07-25)

The job priority system that was refined over years continues to be one of the most sophisticated features in Open Cluster Scheduler and its fully supported companion, Gridware Cluster Scheduler. Yet it's also one of the most underutilized capabilities in many HPC environments.

After working with countless Grid Engine deployments over the years, and now helping organizations transition to Open Cluster Scheduler and Gridware Cluster Scheduler, I've noticed that while administrators often get comfortable with basic queue configuration, they rarely explore the priority configuration mechanisms that can fundamentally change how their clusters serve organizational needs.

Why This Matters Now

In today's HPC landscape, compute resources aren't just expensive—they're strategic business assets. But it's not just about hardware costs; application license costs can easily exceed hardware expenses, with software licenses often running into hundreds of thousands of dollars or more annually. The difference between an optimally configured priority system and default settings can mean the difference between meeting critical deadlines and missing them entirely—while licensed applications sit idle.

The challenge isn't technical complexity. The real challenge is aligning technical capabilities with business requirements. We've seen environments where expensive GPU clusters and costly application licenses remain underutilized while urgent jobs wait in queue, simply because the priority system wasn't configured to reflect organizational priorities and license economics.

The Evolution Continues

What makes this discussion particularly timely is that good old Sun Grid Engine has found new life as Open Cluster Scheduler, with its fully supported companion, Gridware Cluster Scheduler. This isn't just a rebranding—it's a revitalization of the entire Grid Engine ecosystem for modern computing environments.

The priority system remains at the heart of both platforms, but now it's paired with active development, modern container integration, enhanced GPU support, and enterprise backing that organizations need for critical production deployments.

Dive Deeper

The priority system combines share trees, functional policies, override mechanisms, and urgency calculations through a sophisticated weighting system that creates dynamic fairness while ensuring important projects get the resources they need.

If you're ready to optimize your cluster's priority configuration, I've put together a guide that walks through the mathematical foundations, configuration examples, and real-world implementation strategies: Understanding Job Priority Configuration in Gridware Cluster Scheduler.

Gridware Cluster Scheduler 9.0.7: Enhanced Stability and Performance (2025-07-08)

Release Date: 2025-07-08

We're pleased to announce the release of Gridware Cluster Scheduler 9.0.7, built on Open Cluster Scheduler 9.0.7 (formerly known as "Sun Grid Engine"). This release continues our commitment to delivering reliable, high-performance workload management for HPC environments across diverse computing architectures.

Key Improvements in 9.0.7

Enhanced Stability and Reliability

Version 9.0.7 addresses several important areas to improve system reliability:

Thread Safety Improvements: The accounting and reporting code has been made fully thread-safe, eliminating potential race conditions in high-throughput environments.

Core Binding Fixes: Resolved issues with both striding and explicit core binding strategies that could prevent optimal core allocation even when cores were available.

Error Reporting: Fixed truncation issues in error messages displayed by qstat -j, ensuring administrators receive complete diagnostic information.

Installation Improvements: Addressed installer issues and corrected documentation references to ensure smooth deployment experiences.

Seamless Binary Replacement Upgrades

For existing 9.0.x deployments, upgrading to 9.0.7 remains straightforward with our binary replacement approach:

Stop current services
Replace binaries with 9.0.7 versions
Restart services

No configuration changes or extended maintenance windows are required, making this an ideal upgrade for production environments.

Comprehensive Architecture Support

Open Cluster Scheduler 9.0.7 maintains extensive support for modern computing architectures:

x86-64: Full support across major Linux distributions (RHEL, Rocky, Ubuntu, SUSE)
ARM64: Comprehensive support including NVIDIA Grace Hopper platforms
Specialized Architectures: Support for PowerPC (ppc64le), s390x, and RISC-V platforms
Operating Systems: Linux distributions, FreeBSD, Solaris, and macOS (client tools)

This broad compatibility ensures organizations can deploy consistent workload management across heterogeneous computing environments.

Notable Features from the 9.0.x Series

Since many users may be upgrading from earlier versions, it's worth highlighting key capabilities introduced throughout the 9.0.x series:

qtelemetry (Developer Preview)

Integrated metrics exporter for Prometheus and Grafana, providing detailed cluster monitoring including host metrics, job statistics, and qmaster performance data.

Enhanced NVIDIA GPU Support

The qgpu command simplifies GPU resource management with automatic setup, per-job accounting, and support for Grace Hopper architectures.

MPI Integration Templates

Out-of-the-box support for major MPI distributions (Intel MPI, OpenMPI, MPICH, MVAPICH) with ready-to-use parallel environment configurations.

Advanced Resource Management

RSMAP (Resource Map) complex type for managing specialized resources like GPU devices
Per-host consumable resources
Resource and queue requests per scope for parallel jobs

Performance and Scalability

The 9.0.x series represents significant performance improvements over previous versions through:

Multi-threaded Architecture: Separate thread pools for different request types
Enhanced Data Stores: Multiple data stores reducing internal contention
Automatic Session Management: Ensures data consistency while maintaining performance
Optimized Scheduling: Improved algorithms for large-scale deployments

Continued 9.0.x Support

We remain committed to supporting the entire 9.0.x series with ongoing maintenance, security updates, and technical support. This provides organizations with confidence in their long-term deployment strategy while allowing flexibility in upgrade timing.

Getting Started

Quick Evaluation

For testing Open Cluster Scheduler 9.0.7 (the most feature rich and modern open source "Sun Grid Engine" successor) on major Linux distributions:

# Review the script before running
curl -s https://raw.githubusercontent.com/hpc-gridware/quickinstall/refs/heads/main/ocs.sh | OCS_VERSION=9.0.7 sh

If you are interested in our commercially supported Gridware Cluster Scheduler, please speak with us.

Production Deployment

Production environments should follow our comprehensive installation guide included with the release, ensuring proper configuration for specific requirements and environments.

Resources

Source Code & Documentation: GitHub Repository
Release Notes: Complete technical details and full changelog
Community Support: Active development and user community

Looking Forward

Version 9.0.7 reflects our ongoing dedication to providing robust, high-performance workload management solutions. Whether you're running traditional HPC simulations, modern AI workloads, or mixed computing environments, Gridware Cluster Scheduler delivers the reliability and performance your critical applications require.

The combination of enhanced stability, seamless upgrade paths, and broad architecture support makes 9.0.7 an excellent foundation for both current and future computing needs.

For technical questions or deployment assistance, please connect with our community through GitHub or contact our support team. We're committed to helping you maximize the value of your HPC infrastructure.

Quick & Dirty Open Cluster Scheduler 9.0.5 Install Script (2025-05-04)

Update (July 21, 2025): Newer versions are now available! You can install OCS 9.0.6 or 9.0.7 for testing using:

curl -s https://raw.githubusercontent.com/hpc-gridware/quickinstall/refs/heads/main/ocs.sh | OCS_VERSION=9.0.7 sh

If you want to give Open Cluster Scheduler (OCS) 9.0.5 a quick spin without following the whole doc, I've built a simple shell installer. It's for single-node (qmaster/execd) setups. Feel free to add more execds later.

Heads up:
Don't expect this script to work on every distro or minimal OS install without a hitch. You might hit a missing package, lack of man pages, or a small OS quirk. If you run into trouble, please comment in the gist. If it works, give it a like!

How to quick-try (be sure to review the script first!):

curl -s https://gist.githubusercontent.com/dgruber/c880728f4002bfd6a0d360c7f6a27de1/raw/install_ocs_905.sh | sh
or
wget -O - https://gist.githubusercontent.shcom/dgruber/c880728f4002bfd6a0d360c7f6a27de1/raw/install_ocs_905.sh | sh

Again: Please check the script before you run it.

For a serious, production install (with full details and user setup), refer to the official documentation bundled in the OCS doc packages.

MCP Servers Bring AI Reasoning to HPC Cluster Scheduling (2025-04-18)

The Model Context Protocol (MCP) defines a powerful and simple protocol for AI applications to interact with external tools. Its key benefit is modularity: any tool implementing an MCP server can be plugged into any AI application supporting MCP, allowing for seamless integration of specialized context and even control of external software.

Why is This Useful for HPC?

High Performance Computing (HPC) workload managers—like the venerable Open Cluster Scheduler (formerly Grid Engine)—must accommodate an incredible range of use cases. From desktops running a few sequential jobs, to massive clusters processing millions of jobs daily, requirements and configurations can look dramatically different. Admins often become translators, bridging the gap between complex user requests and the equally complex world of scheduler configurations, with diagnostics (like “why aren't my jobs running?”) rarely having a single, straightforward answer.

MCP Server for Open Cluster Scheduler

I just implemented an example MCP server for Open Cluster Scheduler for research and academic use. It enables an AI, like Claude, to translate high-level, natural language questions into low-level cluster operations and return formatted, accurate, and actionable information, potentially combined with other sources. The AI can answer "Why aren't my jobs running?" with cluster context, analyze configurations, generate job overviews, or spot patterns in job submissions.

Getting Started

The MCP server for Open Cluster Scheduler is easy to try in a containerized, simulated cluster. All code and documentation are available on GitHub.

Quickstart on MacOS with ARM / M chip

Simulate a Cluster

 git clone https://github.com/hpc-gridware/go-clusterscheduler.git
 cd go-clusterscheduler
 make build && make simulate

This launches a container with a fake (but realistic!) test cluster. Test with qhost and qsub.

Launch MCP Server
```
 cd cmd/clusterscheduler-mcp
 go build
 ./clusterscheduler-mcp
```
The MCP server opens port 8888, forwarded to your host.

Connect Your AI

Use npx mcp-remote as a wrapper to connect tools like Claude (details in their docs).

Example config for Claude:

{
  "mcpServers": {
    "gridware": {
      "command": "npx",
      "args": ["mcp-remote", "http://localhost:8888/sse"]
    }
  }
}

Interact
Once connected, use natural language to ask about job status, configuration advice, or request analyses.

Example Use Cases

Diagnose why jobs are not running with AI-aided reasoning.
Generate overviews of current and past jobs in tabular or summary form.
Clone and analyze entire cluster setups, creating a digital twin for testing or support.

Screenshots

Running jobs overview table:

Running Jobs Overview

Why is my job not running?

Conclusion

MCP bridges the gap between complex system tooling and natural language AI assistants, making powerful cluster analysis, debugging, and administration accessible even to non-experts. With a growing ecosystem of thousands of MCP servers, and modern AI’s impressive reasoning abilities, troubleshooting and tuning HPC environments has never been easier. In the world of HPC clusters: Why not building a digital twin of your HPC clusters workload manager for getting getting insights, try changes, and test the impact with your LLMs help? The building blocks are all available.

For more details, see the project documentation. Contributions and feedback are always welcome!

Podcast 2: Gridware Cluster Scheduler & Open Cluster Scheduler 9.0.5 (2025-04-16)

I'm excited to share our latest podcast, which explores the release of Gridware Cluster Scheduler 9.0.5 — built on the new Open Cluster Scheduler 9.0.5! Once again, I turned to NotebookLM to generate a dynamic conversation based entirely on our latest release notes and blog posts. In true Grid Engine tradition, I made sure to double-check everything for accuracy, and the result is a concise, informative episode that captures all the key improvements of this major release.

If you're interested in what's new in 9.0.5, how adopting the Open Cluster Scheduler as our foundation strengthens Gridware, or what this means for current and future users, I think you'll really enjoy this episode.

Listen to the podcast

As always, let us know what you think. It’s inspiring to see how AI can help us tell our story even better—while ensuring the technical details are spot-on. Stay tuned for more updates from the Grid Engine and HPC community!

Blog Post About Efficient NVIDIA GPU Management in HPC and AI Workflows with Gridware Cluster Scheduler (2025-04-13)

Over at HPC Gridware I recently published a blog post highlighting how Gridware Cluster Scheduler (formerly known as "Grid Engine") can significantly simplify GPU management and maximize efficiency in HPC and AI environments.

In the post, we cover exciting new capabilities and improvements, including intelligent scheduling to ensure your valuable GPUs never stay idle, automated GPU setup with simple one-line prolog and epilog scripts, and comprehensive per-job GPU monitoring with detailed accounting metrics. We also walk through integrated support for NVIDIA’s latest ARM-based Grace Hopper and Grace Blackwell platforms, showcasing Gridware’s flexibility for modern hybrid compute clusters with mixed compute architectures.

Additionally, The article provides hands-on examples, such as running GROMACS workloads seamlessly on the new NVIDIA architecture, and integrating NVIDIA containers effortlessly using Enroot. To further improve visibility and operational efficiency, Gridware now supports exporting key GPU metrics to Grafana.

Interested in ensuring your GPUs are always working at full capacity while keeping management complexity at bay? Check out the full blog post on HPC Gridware for all the details!

Nav view search

Navigation

Search

FlexLM Integration: Zero-Configuration License Tracking

Faulty Job Load Sensor: Automated Debugging Support

Dynamic Resource Release for Running Jobs

Enhanced Stability and Scheduling Improvements

Download and Upgrade

Read the full post on multi-node job concepts: here

Broad Platform Support

Command Migration Reference

Basic Commands

Job Script Conversion

Environment Variables Migration

Job Script Examples

Serial Job Migration

Parallel Job Migration

OpenMP / CPUs-per-task Migration

Job Array Migration

Queue Configuration Migration

MPI Integrations

Gotchas and Nuances

Why This Matters Now

The Evolution Continues

Dive Deeper

Key Improvements in 9.0.7

Enhanced Stability and Reliability

Seamless Binary Replacement Upgrades

Comprehensive Architecture Support

Notable Features from the 9.0.x Series

qtelemetry (Developer Preview)

Enhanced NVIDIA GPU Support

MPI Integration Templates

Advanced Resource Management

Performance and Scalability

Continued 9.0.x Support

Getting Started

Quick Evaluation

Production Deployment

Resources

Looking Forward

Why is This Useful for HPC?

MCP Server for Open Cluster Scheduler

Getting Started

Quickstart on MacOS with ARM / M chip

Example Use Cases

Screenshots

Conclusion