Discover Valuable HPC / AI Cluster Insights in 5 Minutes Using qtelemetry (2025-02-16)

Observability in HPC and AI clusters continues to evolve, and we’re thrilled to introduce qtelemetry, now in developer preview. This new tool provides deep insight into your Gridware Cluster Scheduler (GCS) environment in just a few minutes. Best of all, it can even work with legacy Grid Engine clusters when needed.

Introducing qtelemetry

qtelemetry streamlines HPC observability by simplifying integrations with tools like Prometheus and Grafana. Key features include:

• Effortless time series integration: Quickly connect your cluster to Prometheus and Grafana.

• Customizable sample dashboards: Keep an eye on jobs overview, hosts overview, and queue metrics.

• Outlier detection: Instantly flag non-functional hosts or jobs stuck in an error state.

• Resource monitoring: Track host loads including CPU, memory, GPU availability, and custom resources.

• sge_qmaster supervision: Monitor critical daemons (e.g., qmaster CPU/memory usage) and spooling filesystem performance.

• Containerized observability stack: Integrates smoothly with container-based deployments for flexible HPC environments.

If you’re eager to try qtelemetry, simply contact us at HPC Gridware!

Seamless Integration with Prometheus & Grafana

Setting up observability for your HPC or AI clusters has never been simpler. qtelemetry lets you feed GCS metrics directly into Prometheus for comprehensive monitoring:

  1. Connect to Prometheus: Collect crucial metrics (host load, memory usage, GPU availability, etc.) transparently in Prometheus.

  2. Visualize with Grafana: Our sample dashboards offer quick snapshots of job states, host performance, and queue details. Customize them to suit your environment.

  3. Alerting & Outlier Detection: Use Grafana’s alerting capabilities to stay on top of failed jobs, overloaded hosts, or daemon performance issues.

Sample Dashboard Examples

Our developer preview comes with sample dashboards to give you immediate visibility into essential HPC metrics:

Figure: A cluster global summary showing the amount of execution hosts, sockets, cores, compute threads, and NVIDIA GPUs available in the cluster along with an overview about running jobs per node.

Figure: Cluster global jobs overview including different states and the possibility to group by waiting time to easily detect if jobs get stuck.

Figure: A cluster queue slots summary view known from qstat -f.

Figure: Monitor CPU load, memory usage, and GPU availability for all hosts.

These intuitive dashboards ensure you can quickly identify non-responsive hosts or stuck jobs, helping your HPC cluster run smoothly.

Get Started with qtelemetry

Ready to unlock valuable cluster insights in just minutes? The qtelemetry developer preview is now available. Connect your Gridware Cluster Scheduler to Prometheus, load our pre-built Grafana dashboards, and begin monitoring your environment.

For additional information or to arrange a demo, please reach out to us at HPC Gridware. We’re excited to help you gain greater visibility into your HPC and AI workloads.

Happy monitoring!

Daniel

HPC Gridware