Grid Engine Unleashed

  • Skip to content
  • Jump to main navigation and login

Nav view search

Navigation

  • Daniel's blog
  • Some Projects
  • About Me

Search

  • Daniel's Blog
  • Programming APIs
  • Grid Engine Specific News
  • Grid Engine Projects
  • Grid Engine Add-Ons and other Sources
  • Grid Engine Internals
  • Grid Engine Documentation
  • Other Stories

Quick & Dirty Open Cluster Scheduler 9.0.5 Install Script (2025-05-04)

If you want to give Open Cluster Scheduler (OCS) 9.0.5 a quick spin without following the whole doc, I've built a simple shell installer. It's for single-node (qmaster/execd) setups. Feel free to add more execds later.

Heads up:
Don't expect this script to work on every distro or minimal OS install without a hitch. You might hit a missing package, lack of man pages, or a small OS quirk. If you run into trouble, please comment in the gist. If it works, give it a like!

How to quick-try (be sure to review the script first!):

curl -s https://gist.githubusercontent.com/dgruber/c880728f4002bfd6a0d360c7f6a27de1/raw/install_ocs_905.sh | sh
or

wget -O - https://gist.githubusercontent.shcom/dgruber/c880728f4002bfd6a0d360c7f6a27de1/raw/install_ocs_905.sh | sh

Again: Please check the script before you run it.

For a serious, production install (with full details and user setup), refer to the official documentation bundled in the OCS doc packages.

MCP Servers Bring AI Reasoning to HPC Cluster Scheduling (2025-04-18)

The Model Context Protocol (MCP) defines a powerful and simple protocol for AI applications to interact with external tools. Its key benefit is modularity: any tool implementing an MCP server can be plugged into any AI application supporting MCP, allowing for seamless integration of specialized context and even control of external software.

Why is This Useful for HPC?

High Performance Computing (HPC) workload managers—like the venerable Open Cluster Scheduler (formerly Grid Engine)—must accommodate an incredible range of use cases. From desktops running a few sequential jobs, to massive clusters processing millions of jobs daily, requirements and configurations can look dramatically different. Admins often become translators, bridging the gap between complex user requests and the equally complex world of scheduler configurations, with diagnostics (like “why aren't my jobs running?”) rarely having a single, straightforward answer.

MCP Server for Open Cluster Scheduler

I just implemented an example MCP server for Open Cluster Scheduler for research and academic use. It enables an AI, like Claude, to translate high-level, natural language questions into low-level cluster operations and return formatted, accurate, and actionable information, potentially combined with other sources. The AI can answer "Why aren't my jobs running?" with cluster context, analyze configurations, generate job overviews, or spot patterns in job submissions.

Getting Started

The MCP server for Open Cluster Scheduler is easy to try in a containerized, simulated cluster. All code and documentation are available on GitHub.

Quickstart on MacOS with ARM / M chip

  1. Simulate a Cluster
     git clone https://github.com/hpc-gridware/go-clusterscheduler.git
     cd go-clusterscheduler
     make build && make simulate
    
    This launches a container with a fake (but realistic!) test cluster. Test with qhost and qsub.
  2. Launch MCP Server
     cd cmd/clusterscheduler-mcp
     go build
     ./clusterscheduler-mcp
    
    The MCP server opens port 8888, forwarded to your host.
  3. Connect Your AI
    • Use npx mcp-remote as a wrapper to connect tools like Claude (details in their docs).
    • Example config for Claude:
      {
        "mcpServers": {
          "gridware": {
            "command": "npx",
            "args": ["mcp-remote", "http://localhost:8888/sse"]
          }
        }
      }
      
  4. Interact
    Once connected, use natural language to ask about job status, configuration advice, or request analyses.

Example Use Cases

  • Diagnose why jobs are not running with AI-aided reasoning.
  • Generate overviews of current and past jobs in tabular or summary form.
  • Clone and analyze entire cluster setups, creating a digital twin for testing or support.

Screenshots

Running jobs overview table:

Running Jobs Overview

Why is my job not running?

Why is my job not running?

Conclusion

MCP bridges the gap between complex system tooling and natural language AI assistants, making powerful cluster analysis, debugging, and administration accessible even to non-experts. With a growing ecosystem of thousands of MCP servers, and modern AI’s impressive reasoning abilities, troubleshooting and tuning HPC environments has never been easier. In the world of HPC clusters: Why not building a digital twin of your HPC clusters workload manager for getting getting insights, try changes, and test the impact with your LLMs help? The building blocks are all available.


For more details, see the project documentation. Contributions and feedback are always welcome!

Podcast 2: Gridware Cluster Scheduler & Open Cluster Scheduler 9.0.5 (2025-04-16)

I'm excited to share our latest podcast, which explores the release of Gridware Cluster Scheduler 9.0.5 — built on the new Open Cluster Scheduler 9.0.5! Once again, I turned to NotebookLM to generate a dynamic conversation based entirely on our latest release notes and blog posts. In true Grid Engine tradition, I made sure to double-check everything for accuracy, and the result is a concise, informative episode that captures all the key improvements of this major release.

If you're interested in what's new in 9.0.5, how adopting the Open Cluster Scheduler as our foundation strengthens Gridware, or what this means for current and future users, I think you'll really enjoy this episode.

Listen to the podcast

As always, let us know what you think. It’s inspiring to see how AI can help us tell our story even better—while ensuring the technical details are spot-on. Stay tuned for more updates from the Grid Engine and HPC community!

Blog Post About Efficient NVIDIA GPU Management in HPC and AI Workflows with Gridware Cluster Scheduler (2025-04-13)

Over at HPC Gridware I recently published a blog post highlighting how Gridware Cluster Scheduler (formerly known as "Grid Engine") can significantly simplify GPU management and maximize efficiency in HPC and AI environments.

In the post, we cover exciting new capabilities and improvements, including intelligent scheduling to ensure your valuable GPUs never stay idle, automated GPU setup with simple one-line prolog and epilog scripts, and comprehensive per-job GPU monitoring with detailed accounting metrics. We also walk through integrated support for NVIDIA’s latest ARM-based Grace Hopper and Grace Blackwell platforms, showcasing Gridware’s flexibility for modern hybrid compute clusters with mixed compute architectures.

Additionally, The article provides hands-on examples, such as running GROMACS workloads seamlessly on the new NVIDIA architecture, and integrating NVIDIA containers effortlessly using Enroot. To further improve visibility and operational efficiency, Gridware now supports exporting key GPU metrics to Grafana.

Interested in ensuring your GPUs are always working at full capacity while keeping management complexity at bay? Check out the full blog post on HPC Gridware for all the details!

Questions? Contact me here!