Introducing Flexible Accounting in Gridware Cluster Scheduler: Collect Arbitrary Job Metrics (2025-02-09)
Ever dreamed of capturing custom metrics for your jobs—like user-generated performance counters or application-specific usage data—directly in your accounting logs? Gridware Cluster Scheduler (and its open source companion, Open Cluster Scheduler) just made it a reality with brand-new “flexible accounting” capabilities.
Why Flexible Accounting Matters
In HPC environments, traditional accounting systems can be limiting. They typically capture CPU time, memory usage, perhaps GPU consumption—yet your workflow might demand more: e.g., analyzing model accuracy, network throughput, or domain-specific metrics. With Gridware’s flexible accounting, you can insert arbitrary fields into the system’s accounting file, just by placing a short epilog script in your queue configuration. Then, whenever you run qacct -j <job_id>
, these additional metrics appear neatly alongside standard resource usage.
How It Works
In essence, the cluster scheduler calls a admin-defined epilog after each job completes. This small script (it can be in Go, Python, or any language you like) appends as many numeric "key-value" pairs as you wish to the scheduler’s accounting file. For example, you might measure data from your application’s logs (say, images processed or inference accuracy), then push those numbers right into the accounting system. The code snippet below (in Go) demonstrates how easily you can add random metrics—just replace them with values drawn from your own logic:
package main
import (
"fmt"
"github.com/hpc-gridware/go-clusterscheduler/pkg/accounting"
"golang.org/x/exp/rand"
)
func main() {
usageFilePath, err := accounting.GetUsageFilePath()
if err != nil {
fmt.Printf("Failed to get usage file path: %vn", err)
return
}
err = accounting.AppendToAccounting(usageFilePath, []accounting.Record{
{
AccountingKey: "tst_random1",
AccountingValue: rand.Intn(1000),
},
{
AccountingKey: "tst_random2",
AccountingValue: rand.Intn(1000),
},
})
if err != nil {
fmt.Printf("Failed to append to accounting: %vn", err)
}
}
Once you have defined your epilog you need to configure it in the cluster queue configuration (qconf -mq all.q
).
epilog sgeadmin@/path/to/flexibleaccountingepilog
Here the sgeadmin is the installation user of Gridware / Open Cluster Scheduler as he has the right permissions to do that.
Finally accepting the new values in that particular format in the system needs to be enabled globally (qconf -mconf
).
reporting_params ... usage_patterns=test:tst*
Here we allow tst prefixed values which are then stored in the internal JSONL accounting file in the "test" sections.
That’s all—no core scheduler modifications needed. Run your jobs normally, let them finish, then check out your new columns in qacct
.
Unlocking More Insights
This feature is particularly powerful for HPC clusters applying advanced analytics. Need to track per-user image accuracy scores or data ingestion rates? Or capture domain-specific variables for auditing and compliance? Flexible accounting provides a simple, robust mechanism to store all that data consistently.
And remember: Open Cluster Scheduler users get these same advantages—just take care of a little manual configuration. Because this functionality is unique to Gridware and Open Cluster Scheduler, you won’t find it in other legacy Grid Engine variants.
Conclusion
Spend less time mashing logs together and more time exploring richer cluster usage data. Flexible accounting transforms ordinary HPC accounting into a full-blown, customizable metrics infrastructure. Whether you’re fine-tuning AI workflows or verifying compliance, you now have the freedom to store precisely the information you need—right where you expect to see it.
Running Simple PyTorch Jobs on Gridware Cluster Scheduler: Generating 1000 Cat Images (2025-01-26)
The most valuable digital assets in human history are undoubtedly cat pictures :). While they once dominated the web, today they have become a focal point for AI-generated imagery. Therefore, why not demonstrate using the Gridware Cluster Scheduler by generating 1000 high-resolution cat images?
Key components include:
- Gridware Cluster Scheduler installation
- Optional use of Open Cluster Scheduler
- Python
- PyTorch
- Stable Diffusion
- GPUs
System Setup
After installing the Gridware Cluster Scheduler and enabling GPU support according to the Admin Guide, NVIDIA_GPUS resources become automatically available. The GPU integration sets NVIDIA-related environment variables as selected by the scheduler for each job, and ensuring accurate per-job GPU accounting.
When using the free Open Cluster Scheduler, additional manual configuration is necessary (see this blog post).
In this example, I assume PyTorch and the required Python libraries are available on the nodes. To minimize machine dependencies, containers can be used, although this is not the focus of the blog. HPC-compatible container runtime environments such as Apptainer (formerly Singularity) and Podman can be used with the Gridware Cluster Scheduler out-of-the-box.
Submitting Cat Image Creation Jobs
To create cat images, we need input prompts, lots of them. They can be stored in an array like this:
prompts = [
"A playful cat stretches its ears while rubbing against soft fur...",
"A curious cat leaps gracefully to catch a tiny mouse...",
# ... (other prompts)
]
To generate many prompts, we can automate their creation:
import random
def generate_pro_prompts():
cameras = [
"Canon EOS R5", "Sony α1", "Nikon Z9", "Phase One XT",
"Hasselblad X2D", "Fujifilm GFX100 II", "Leica M11",
"Pentax 645Z", "Panasonic S1R", "Olympus OM-1"
]
lenses = [
"85mm f/1.2", "24-70mm f/2.8", "100mm Macro f/2.8",
"400mm f/2.8", "50mm f/1.0", "12-24mm f/4",
"135mm f/1.8", "Tilt-Shift 24mm f/3.5", "8-15mm Fisheye",
"70-200mm f/2.8"
]
styles = [
"award-winning wildlife", "editorial cover", "cinematic still",
"fine art gallery", "commercial product", "documentary",
"fashion editorial", "scientific macro", "sports action",
"architectural interior"
]
lighting = [
"golden hour backlight", "softbox Rembrandt", "dappled forest",
"blue hour ambient", "studio butterfly", "silhouette contrast",
"LED ring light", "candlelit warm", "neon urban", "moonlit"
]
cats = [
"Maine Coon", "Siamese", "Bengal", "Sphynx", "Ragdoll",
"British Shorthair", "Abyssinian", "Persian", "Scottish Fold",
"Norwegian Forest"
]
actions = [
"mid-leap", "grooming", "playing", "sleeping", "stretching",
"climbing", "hunting", "yawning", "curious gaze", "pouncing"
]
# Generate random combinations
prompts = []
for _ in range(1000):
style = random.choice(styles)
cat = random.choice(cats)
action = random.choice(actions)
camera = random.choice(cameras)
lens = random.choice(lenses)
light = random.choice(lighting)
aperture = round(1.4 + random.random() * 7.6, 1) # Random f-stop between f/1.4 and f/9
shutter_speed = 1 / random.randint(1, 4000) # Random shutter speed
iso = random.choice([100, 200, 400, 800, 1600, 3200, 6400]) # Random ISO
prompt = (f"{style} photo of {cat} cat {action} | "
f"{camera} with {lens} | {light} lighting | "
f"f/{aperture} {shutter_speed:.0f}s ISO {iso} | "
f"Technical excellence award composition")
prompts.append(prompt)
return prompts
Once we have an array of prompts, we can divide them into chunks so that each batch job generates more than one image. This approach can be further optimized later (not part of this article).
The critical aspect here is job submission. The prompts are supplied to the job as environment variables using the -v switch. The -q switch assigns the gpu.q to the job, which is assumed to be configured across the GPU nodes. The -l switch selects 1 GPU device per job, ensuring the GPU integration sets the appropriate NVIDIA environment variables so that jobs don’t conflict. This is accomplished through the new qgpu utility called in the gpu.q's prolog. For the Open Cluster Scheduler, you need to configure this manually. That means configuring the NVIDIA_GPUS resource as RSMAP with the GPU ID range, while the job itself must convert the SGE_HGR_NVIDIA_GPUS environment variable set in the job to the NVIDIA environment variables (see this blog post).
The job itself, executed on the compute node, is the python3 script located on the shared storage.
Here's the submit.py script:
def main():
prompts = generate_pro_prompts()
parser = argparse.ArgumentParser()
parser.add_argument('--chunk-size', type=int, required=True,
help='Number of prompts per job')
args = parser.parse_args()
# Split prompts into chunks
chunks = [prompts[i:i+args.chunk_size]
for i in range(0, len(prompts), args.chunk_size)]
# Submit jobs
for i, chunk in enumerate(chunks):
try:
# Serialize chunk to JSON for safe transmission
prompts_json = json.dumps(chunk)
subprocess.run(
[
"qsub", "-j", "y", "-b", "y",
"-v", f"INPUT_PROMPT={prompts_json}",
"-q", "gpu.q",
"-l", "NVIDIA_GPUS=1",
"python3", "/home/nvidia/genai1/run.py"
],
check=True
)
print(f"Submitted job {i+1} with {len(chunk)} prompts")
except subprocess.CalledProcessError as e:
print(f"Failed to submit job {i+1}. Error: {e}")
if __name__ == "__main__":
main()
The run.py script carries out the intensive process using the cat-related prompts. Note, there is a lot of potential for obvious improvements - which is not the focus of the article. It retrieves prompts from the INPUT_PROMPT environment variable together with the unique JOB_ID assigned by the Gridware Cluster Scheduler, exactly like in SGE. The images are stored in the job's working directory, which is assumed to be shared. Learn more about the DiffusionPipeline utilizing Stable Diffusion at HuggingFace.
import logging
import os
import json
from diffusers import DiffusionPipeline
import torch
from PIL import Image
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def generate_image(prompt: str, output_path: str):
try:
logging.info("Loading model...")
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16"
)
pipe.to("cuda")
except Exception as e:
logging.error("Pipeline init error: %s", str(e))
return
try:
logging.info("Generating: %s", prompt)
image = pipe(prompt=prompt).images[0]
image.save(output_path)
logging.info("Saved: %s", output_path)
except Exception as e:
logging.error("Generation error: %s", str(e))
if __name__ == "__main__":
# Get job context
job_id = os.getenv("JOB_ID", "unknown_job")
# Parse prompts from JSON
try:
prompt_list = json.loads(os.getenv("INPUT_PROMPT", "[]"))
except json.JSONDecodeError:
logging.error("Invalid prompt format!")
prompt_list = []
# Process all prompts in chunk
for idx, prompt in enumerate(prompt_list):
if not prompt.strip():
continue # skip empty prompts
output_path = f"image_{job_id}_{idx}.png"
generate_image(prompt, output_path)
To submit the AI inference jobs to the system, simply execute:
python3 submit.py --chunk-size 5
This results in 200 queued jobs, each capable of creating 5 images.
Supervising AI Job Execution
When correctly configured, the Gridware Cluster Scheduler executes jobs on available GPUs at the appropriate time. You can check the status with qstat
or get job-related information with qstat -j <jobid>
. After a while, you will have your 1000 cat images. Once completed, you can also view per-job GPU usage in qacct -j <jobid>
, with metrics like nvidia_energy_consumed
and nvidia_power_usage_avg
, as well as the submission command line with the prompts, for example:
qacct -j 2900
==============================================================
qname gpu.q
hostname XXXXXX.server.com
group nvidia
owner nvidia
project NONE
department defaultdepartment
jobname python3
jobnumber 2900
taskid undefined
pe_taskid NONE
account sge
priority 0
qsub_time 2025-01-26 11:02:12.650259
submit_cmd_line qsub -j y -b y -v 'INPUT_PROMPT=["award-winning wildlife photo of Maine Coon cat mid-leap | Canon EOS R5 with 85mm f/1.2 | golden hour backlight lighting | f/2.4 1/1500s ISO 300 | Technical excellence award composition", "editorial cover photo of Siamese cat grooming | Sony u03b11 with 24-70mm f/2.8 | softbox Rembrandt lighting | f/2.9 1/2000s ISO 400 | Technical excellence award composition", "cinematic still photo of Bengal cat playing | Nikon Z9 with 100mm Macro f/2.8 | dappled forest lighting | f/3.4 1/2500s ISO 500 | Technical excellence award composition", "fine art gallery photo of Sphynx cat sleeping | Phase One XT with 400mm f/2.8 | blue hour ambient lighting | f/3.9 1/3000s ISO 600 | Technical excellence award composition", "commercial product photo of Ragdoll cat stretching | Hasselblad X2D with 50mm f/1.0 | studio butterfly lighting | f/4.4 1/3500s ISO 100 | Technical excellence award composition"]' -q gpu.q -l NVIDIA_GPUS=1 python3 /home/nvidia/genai1/run.py
start_time 2025-01-26 13:13:06.172316
end_time 2025-01-26 13:15:19.362819
granted_pe NONE
slots 1
failed 0
exit_status 0
ru_wallclock 133
ru_utime 135.470
ru_stime 1.319
ru_maxrss 7142080
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 74728
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 14208
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 2488
ru_nivcsw 846
wallclock 134.238
cpu 136.789
mem 6587.407
io 0.096
iow 0.000
maxvmem 61648994304
maxrss 7112032256
arid undefined
nvidia_energy_consumed 26058.000
nvidia_power_usage_avg 158.000
nvidia_power_usage_max 158.000
nvidia_power_usage_min 0.000
nvidia_max_gpu_memory_used 0.000
nvidia_sm_clock_avg 1980.000
nvidia_sm_clock_max 1980.000
nvidia_sm_clock_min 1980.000
nvidia_mem_clock_avg 2619.000
nvidia_mem_clock_max 2619.000
nvidia_mem_clock_min 2619.000
nvidia_sm_utilization_avg 0.000
nvidia_sm_utilization_max 0.000
nvidia_sm_utilization_min 0.000
nvidia_mem_utilization_avg 0.000
nvidia_mem_utilization_max 0.000
nvidia_mem_utilization_min 0.000
nvidia_pcie_rx_bandwidth_avg 0.000
nvidia_pcie_rx_bandwidth_max 0.000
nvidia_pcie_rx_bandwidth_min 0.000
nvidia_pcie_tx_bandwidth_avg 0.000
nvidia_pcie_tx_bandwidth_max 0.000
nvidia_pcie_tx_bandwidth_min 0.000
nvidia_single_bit_ecc_count 0.000
nvidia_double_bit_ecc_count 0.000
nvidia_pcie_replay_warning_count 0.000
nvidia_critical_xid_errors 0.000
nvidia_slowdown_thermal_count 0.000
nvidia_slowdown_power_count 0.000
nvidia_slowdown_sync_boost 0.000
This example demonstrates how easily you can utilize the Gridware Cluster Scheduler to keep your GPUs engaged continuously, regardless of the frameworks, models, or input data your cluster users employ for single-GPU jobs, multi-GPU jobs, or multi-node multi-GPU jobs using MPI. Using containers or just using applications available on the command line.
Below you can find some output examples...enjoy :)





HPC Gridware Unveils Gridware Cluster Scheduler 9.0.2, Adding NVIDIA Grace Hopper Support (2025-01-23)
A Leap Forward for AI and HPC Workloads
Gridware has unveiled the latest iteration of its Gridware Cluster Scheduler (GCS) 9.0.2, now featuring native support for NVIDIA’s Grace Hopper Superchip. This update, highlighted by HPCwire, marks a significant stride in optimizing high-performance computing (HPC) and AI infrastructure.
Gridware Cluster Scheduler Upgrades: Faster, Smarter, and Ready for AI Workloads (2025-01-20)
If you’re managing HPC or AI clusters with Grid Engine, scalability is probably your daily obsession. As core counts explode and workloads grow more complex, the Gridware Cluster Scheduler (GCS) just leveled up to keep pace—and here’s why it matters to you.
What’s Changed?
We’ve rebuilt core parts of GCS to tackle bottlenecks head-on. Let’s break down what’s new:
1. Self-Sustaining Data Stores
- Problem: The old monolithic data store couldn’t handle parallel requests efficiently. Think authentication delays or
qstat
queries clogging the system. - Fix: We split the data store into smaller, independent components. For example, authentication now runs in its own thread pool, fully parallelized. No more waiting for the main scheduler to free up.
- Result: Need to submit 50% more jobs per second? Done. Query job status (
qstat -j
), hosts (qhost
), or resources (qstat -F
) 2.5x faster? Check.
2. Cascaded Thread Pools
- How it Works: Tasks are split into sub-tasks, each handled by dedicated thread pools. Think of it like assembly lines for requests—auth, job queries, node reporting—all running in parallel.
- Why It Matters: Even under heavy load, GCS now processes more requests without choking. We measured 25% faster job runtimes in tests, even with heavier submit rates.
3. No More Session Headaches
- Old Pain: Ever had a job submission finish but
qstat
not see it immediately? Traditional WLMs make you manage sessions manually. - New Fix: GCS auto-creates cross-host sessions. Submit a job, query it right after—no extra steps. Consistency without the fuss.
Why AI/ML Clusters Win Here
AI workloads aren’t just about GPUs—they demand massive parallel job submissions, rapid status checks, and resource juggling. These upgrades mean:
- Faster job throughput: Submit more training jobs without queue lag.
- Instant resource visibility:
qstat -F
orqconf
queries won’t slow down your workflow. - Scalability: Handle thousands of nodes reporting status (thanks,
sge_execd
!) without bottlenecking the scheduler.
What’s Next?
We’re eyeing predictive resource scheduling (think ML-driven job forecasting) and better GPU/CPU hybrid support. But today’s updates already make GCS a reliable solutions for modern clusters.
Try It Yourself
The Open Cluster Scheduler code is on GitHub, with prebuilt packages for:
- Linux:
lx-amd64
,lx-arm64
,lx-riscv64
,lx-ppc64le
,lx-s390x
- BSD:
fbsd-amd64
- Solaris:
sol-amd64
- Legacy Linux:
ulx-amd64
,xlx-amd64
Download Links:
For Grid Engine users, this isn’t just an upgrade—it’s a toolkit built for the scale AI and HPC demand. Test it, push it, and let us know how it runs on your cluster.
Dive deeper into the technical details here.
Questions or feedback? Reach out—we’re all about making Grid Engine work harder for you.
Podcast: Open Cluster Scheduler vs. Gridware Cluster Scheduler (2024-10-21)
I couldn't resist using NotebookLM to create a podcast about our first releases: the Open Cluster Scheduler and the Gridware Cluster Scheduler. NotebookLM is gaining viral attention, thanks to its remarkable capabilities tailored for such tasks.
Creating this podcast was a five-minute task—simply uploading the blog posts about the Open Cluster Scheduler and the Gridware Cluster Scheduler and letting the conversation be generated. Most of the time was spent double-checking the content, and I must admit, it got it right on the first try!
As one of the co-founders of HPC Gridware, I absolutely agree with what the AI is saying about it. Hear for yourself: :-)
Mastering GPUs with Open Cluster Scheduler (2024-07-1)
Mastering GPUs with Open Cluster Scheduler's RSMAP
Check out the full article here.
Unlock the full potential of your GPU resources with Open Cluster Scheduler's latest feature — Resource Map (RSMAP). This powerful and flexible resource type ensures efficient and conflict-free utilization of specific resource instances, such as GPUs.
Why Choose RSMAP?
- Collision-Free Use: Ensures exclusive access to resources, preventing conflicts between jobs.
- Detailed Monitoring & Accounting: Facilitates precise tracking and reporting of actual resource usage.
- Versatile Resource Management:
- Host-Level Resources: Manage local resources such as port numbers, GPUs, NUMA resources, network devices, and storage devices.
- Global Resources: Manage network-wide resources like IP addresses, DNS names, license servers, and port numbers.
Example: Efficient GPU Management
Define Your GPU Resource: Begin by opening the resource complex configuration using
qconf -mc
and add the following line:GPU gpu RSMAP <= YES YES NONE 0
This defines a resource named
GPU
using the RSMAP type, marking it as requestable and consumable with specific constraints.Initialize Resource on Hosts: Assign values to the GPU resources on a specific host by modifying the host configuration with
qconf -me <hostname>
. For a host with 4 GPUs:complex_values GPU=4(0 1 2 3)
This indicates the host has 4 GPU instances with IDs 0, 1, 2, and 3.
Submit Your Job: Request GPU resources in your job script:
#!/bin/bash env | grep SGE_HGR
Submit the job with the command:
qsub -l GPU=2 ./job.sh
Your job will now be allocated the requested GPU resources, which can be confirmed by checking the output for granted GPU IDs. Convert these IDs for use with NVIDIA jobs:
export CUDA_VISIBLE_DEVICES=$(echo $SGE_HGR_GPU | tr ' ' ',')
This innovative approach to resource management enhances both performance and resource tracking, making it a must-have for efficient computing. Plus, HPC Gridware set to release a new GPU package featuring streamlined configuration, improved GPU accounting, and automated environment variable management, taking the hassle out of GPU cluster management.
For more detailed information, check out the full article here. It's your go-to guide for mastering GPU management with the Open Cluster Scheduler!
Open Cluster Scheduler: The Future of Open Source Workload Management (2024-06-10)
See also our announcement at HPC Gridware
Dear Community,
We are thrilled to announce that the source code repository for the Open Cluster Scheduler is now officially open-sourced and available at github.com/hpc-gridware/clusterscheduler.
The Open Cluster Scheduler is the cutting-edge successor to renowned open-source workload management systems such as "Sun Grid Engine", "Univa Grid Engine Open Core", "Son of Grid Engine," and others. With a development history spanning over three decades, its origins can be traced back to the Distributed Queueing System (DQS), and it achieved widespread adoption under the name "Sun Grid Engine".
A Solution for the AI Era
As the world pivots towards artificial intelligence and high-performance computing, the necessity for an efficient and open-source cluster scheduler has never been more urgent. In today's GPU cluster environments, harnessing full hardware utilization is not only economically beneficial but also accelerates results, enables more inference tasks per hour, and facilitates the creation of more intricate AI models.
Why Open Cluster Scheduler?
There is a real gap in the market for open-source workload managers, and Open Cluster Scheduler is here to fill it with a whole host of remarkable features:
- Dynamic, On-Demand Cluster Configuration: Make changes without the need to restart services or daemons.
- Standard-Compliant Interfaces and APIs: Enjoy compatibility with standard command-line interfaces (qsub, qstat, …) and standard APIs like DRMAA.
- High Throughput: Efficiently handle millions of independent compute jobs daily.
- Mixed Job Support: Run large MPI jobs alongside small, short single-node tasks seamlessly without altering configurations.
- Rapid Submission: Submit thousands of different jobs within seconds.
- High Availability: Ensure reliability and continuous operation.
Optimized for Performance
Open Cluster Scheduler is meticulously optimized across all dimensions:
- Binary Protocol Between Daemons: Enhances communication efficiency.
- Multi-threaded Scheduler: Ensures optimal performance.
- Written in C++/C: Delivers robust and high-speed computing.
- Multi-OS and Architecture Support: Compatible with architectures including AMD64, ARM64, RISC-V, and more.
Looking Forward
We are committed to evolving Open Cluster Scheduler into a modern solution that will be capable of managing highly demanding compute workloads across diverse computational environments, whether on-premises or in the cloud.
We invite you to explore, contribute, and join us in this exciting new chapter. Together, we can shape the future of high-performance computing.
Visit our repository: github.com/hpc-gridware/clusterscheduler
Thank you for your continued support and enthusiasm.
Sincerely,
Daniel, Ernst, Joachim
Enhancing wfl with Large Language Models: Researching the Power of GPT for (HPC/AI) Job Workflows (2023-05-14)
In today's world of ever-evolving technology, the need for efficient and intelligent job workflows is more important than ever. With the advent of large language models (LLMs) like GPT, we can now leverage the power of AI to create powerful and sophisticated job workflows. In this blog post, I'll explore how I've enhanced wfl, a versatile workflow library for Go, by integrating LLMs like OpenAI's GPT. I'll dive into three exciting use cases: job error analysis, job output transformation, and job template generation.
wfl: A Brief Overview
wfl is a flexible workflow Go library designed to simplify the process of creating and managing job workflows. It's built on top of the DRMAA2 standard and supports various backends like Docker, Kubernetes, Google Batch, and more. With wfl, users can create complex workflows with ease, focusing on the tasks at hand rather than the intricacies of job management.
Enhancing the go wfl library with LLMs
I've started to enhance wfl by integrating large language models (OpenAI), enabling users to harness the power of AI to enhance their job workflows even further. By utilizing GPT's natural language understanding capabilities, we can now create more intelligent and adaptable workflows that can be tailored to specific requirements and challenges. This not only expands the possibilities for research but also increases the efficiency of job workflows. These workflows can span various domains, including AI workflows and HPC workflows.
It's important to note that this is a first research step in applying LLMs to wfl, and I expect to find new and exciting possibilities build upon these three basic use cases.
1. Job Error Analysis
Errors are inevitable in any job workflow, but understanding and resolving them can be a time-consuming and tedious process. With the integration of LLMs in wfl, we can now analyze job errors more efficiently and intelligently. By applying a prompt to an error message, the LLM can provide a detailed explanation of the error and even suggest possible solutions. This can significantly reduce the time spent on debugging and increase overall productivity.
2. Job Output Transformation
Sometimes, the raw output of a job can be difficult to understand or may require further processing to extract valuable insights. With LLMs, we can now apply a prompt to the output of a job, transforming it into a more understandable or usable format. For example, we can use a prompt to translate the output into a different language, summarize it, or extract specific information. This can save time and effort while enabling to extract maximum value from my job outputs.
3. Job Template Generation
Creating job templates can be a complex and time-consuming process, especially when dealing with intricate workflows. With the integration of LLMs in wfl, we can now also generate job templates based on textual descriptions, making the process more intuitive and efficient. By providing a prompt describing the desired job, the LLM can generate a suitable job template that can be analyzed, customized and executed. This not only simplifies the job creation process but also enables users to explore new possibilities and ideas more quickly. Please use this with caution and do not execute generated job templates without additional security verifications! I guess such verifications when automated could be a whole new research area.
Conclusion
The integration of large language models like GPT into wfl has opened up a world of possibilities for job workflows in HPC, AI, and enterprise jobs. By leveraging the power of AI, you can now create more intelligent and adaptable workflows that can address specific challenges and requirements. Further use cases, like building whole job flows upon the building blocks, needs to be investigated
To learn more about wfl and how to harness the power of LLMs for your job workflows, visit the WFL GitHub repository: https://github.com/dgruber/wfl/
A basic sample application demonstrating the Go interface is here: https://github.com/dgruber/wfl/tree/master/examples/llm_openai
Streamline Your Machine Learning Workflows with the wfl Go Library (2023-04-10)
wfl is a versatile and user-friendly Go library designed to simplify the management and execution of workflows. In this blog post, we will explore how wfl can be employed for various machine learning tasks, including sentiment analysis, artificial neural networks (ANNs), hyperparameter tuning, and convolutional neural networks (CNNs).
Sentiment Analysis with wfl
Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text. One straightforward approach for sentiment analysis is to use textblob, like in wfl's sentiment analysis example. This example demonstrates how to use a simple sentiment analysis model to classify marketing phrases as positive or negative.
Sentiment Analysis with ANNs and wfl
To enhance the performance of sentiment analysis, we can use an artificial neural network (ANN) trained on a dataset of labeled marketing phrases. The sentiment analysis with Keras example demonstrates how to implement, train, and deploy an ANN using the Keras library and wfl. This example shows how to use the wfl library to manage the training workflow and execute the ANN model for sentiment analysis.
Hyperparameter Tuning with wfl
Hyperparameter tuning is the process of finding the best set of hyperparameters for a machine learning model. wfl's hyperparameter tuning example demonstrates how to perform hyperparameter tuning using the Keras library and wfl. This example shows how to use wfl to manage and execute a grid search to find the optimal hyperparameters, such as learning rate, batch size, and epochs, for a deep learning model.
Hyperparameter Tuning with Google Batch
As hyperparameter tuning can be computationally expensive, it can be beneficial to distribute the workload across multiple machines. wfl's hyperparameter tuning with Google Batch example demonstrates how to use the Google Batch implementation of the DRMAA2 interface to distribute the hyperparameter tuning workload on Google Cloud, significantly accelerating the process and reducing the computational burden on your local machine.
Convolutional Neural Networks with Cifar10 and wfl
Convolutional neural networks (CNNs) are a type of deep learning model particularly suited for image classification tasks. The CNN with Cifar10 example demonstrates how to use wfl to manage the training workflow of a CNN using the Cifar10 dataset. This example shows how to use wfl to train a CNN on Google Cloud and store the trained model in a Google Cloud Storage bucket.
In conclusion, wfl is a nice tool for streamlining your machine learning workflows, from simple sentiment analysis to intricate CNNs. It offers an easy-to-use interface for managing and executing machine learning tasks, and its integration with cloud platforms like Google Cloud enables you to scale your workloads effortlessly. Give wfl a try and see how it can enhance your machine learning projects! Any feedback - especially things which don't work or are hard to figure out is welcome. Please use github for opening an issue.
Creating Kubernetes based UberCloud HPC Application Clusters using Containers (2020-12-16)
This article was originally published at UberCloud's Blog
UberCloud provides all necessary automation for integrating cloud based self-service HPC application portals in enterprise environments. Due to the significant differences inside the IT landscape of large organizations we are continuously challenged providing the necessary flexibility within our own solution stack. Hence we continuously evaluate newly adopted tools and technologies about the readiness to interact with UberCloud’s technology.
Recent adoption of Kubernetes not just for enterprise workloads but for all sorts of applications, be it on the edge, AI, or for HPC, has strong focus. We created hundreds of Kubernetes clusters on various cloud providers hosting HPC applications like Ansys, Comsol, OpenFoam and many more. We can deploy fully configured HPC clusters which are dedicated to an engineer on GKE or AKS within minutes. We can also use EKS but the deployment time of an EKS cluster is at this point in time significantly slower as on the other platforms (around 3x times). While GKE is excellent and has been my favorite service (due to its deployment speed and its good APIs), AKS has begun in the last months to get really strong. Many features which are relevant for us (like using spot instances and placement groups) and its speed in terms of AKS cluster allocation time (now even almost one minute faster as GKE - 3:30 min. from 0 to a fully configured AKS cluster) have been implement on Azure. Great!
When managing HPC applications in dedicated Kubernetes clusters one challenge remains: How to manage fleets of clusters distributed across multiple clouds? At UberCloud we are building simple tools which takes HPC application start requests and turns it into a fully automated cluster creation and configuration job. One very popular way is to put this logic behind self-service portals where the user selects an application he/she want to use. One other way is creating those HPC applications based on events in workflows, CI/CD and gitops pipelines. Use cases are automated application testing, running automated compute tasks, cloud bursting, infrastructure as code integrations, and more. To support those tasks we’ve developed a container which turns an application and infrastructure description into a managed Kubernetes cluster independent of where the job runs and on which cloud provider and regions the cluster is created.
Due to the flexibility of containers UberCloud’s cluster creation container can be used in almost all modern environments which support containers. We are using it as a Kubernetes job and as CI/CD tasks. When the job is finished, the engineer has access to a fully configured HPC desktop including a HPC cluster attached.
Another integration we just tested is Argo. Argo is a popular workflow engine targeted and working on top of Kubernetes. We have a test installation running on GKE. As the UberCloud HPC cluster creation is fully wrapped inside a container running a single binary the configuration required to integrate it in a Argo workflow is very minimal.
After the workflow (task) is finished, the engineer get’s automatically access to the freshly created remote visualization application running on a newly allocated AKS cluster spanning two node pools having GUI based remote Linux desktop access setup.
The overall AKS cluster creation, configuration, and deployment of our services and HPC application containers just took a couple of minutes. A solution targeted for IT organizations challenged by the task of rolling out HPC applications for their engineers but required to work with modern cloud based technologies.
Hybrid Cloud interactive HPC Applications on Kubernetes (2020-03-19)
We just published a follow up article about our experiences at UberCloud running HPC Engineering Applications on different Clouds and on-premises using Kubernetes as middleware-stack.
Paper about Virtual Grid Engine (2019/12/12)
Over in Japan researchers working on the K supercomputer recently published a paper about a middleware software which they call VGE - Virtual Grid Engine. It allows them to run bio informatics software in Grid Engine style on K. Certainly a great read!Kubernetes, Containers, and HPC at UberCloud (2019-09-20)
Long time no updates on my blog. Need to change that…
@UberCloud we published a whitepaper about our experiences running HPC workload on Kubernetes. You can download it here.
A shortened version appeared today as lead-article at HPCWire.