Running Simple PyTorch Jobs on Gridware Cluster Scheduler: Generating 1000 Cat Images (2025-01-26)

The most valuable digital assets in human history are undoubtedly cat pictures :). While they once dominated the web, today they have become a focal point for AI-generated imagery. Therefore, why not demonstrate using the Gridware Cluster Scheduler by generating 1000 high-resolution cat images?

Key components include:

  • Gridware Cluster Scheduler installation
  • Optional use of Open Cluster Scheduler
  • Python
  • PyTorch
  • Stable Diffusion
  • GPUs

System Setup

After installing the Gridware Cluster Scheduler and enabling GPU support according to the Admin Guide, NVIDIA_GPUS resources become automatically available. The GPU integration sets NVIDIA-related environment variables as selected by the scheduler for each job, and ensuring accurate per-job GPU accounting.

When using the free Open Cluster Scheduler, additional manual configuration is necessary (see this blog post).

In this example, I assume PyTorch and the required Python libraries are available on the nodes. To minimize machine dependencies, containers can be used, although this is not the focus of the blog. HPC-compatible container runtime environments such as Apptainer (formerly Singularity) and Podman can be used with the Gridware Cluster Scheduler out-of-the-box.

Submitting Cat Image Creation Jobs

To create cat images, we need input prompts, lots of them. They can be stored in an array like this:

prompts = [
    "A playful cat stretches its ears while rubbing against soft fur...",
    "A curious cat leaps gracefully to catch a tiny mouse...",
    # ... (other prompts)
]

To generate many prompts, we can automate their creation:

import random

def generate_pro_prompts():
    cameras = [
        "Canon EOS R5", "Sony α1", "Nikon Z9", "Phase One XT",
        "Hasselblad X2D", "Fujifilm GFX100 II", "Leica M11",
        "Pentax 645Z", "Panasonic S1R", "Olympus OM-1"
    ]

    lenses = [
        "85mm f/1.2", "24-70mm f/2.8", "100mm Macro f/2.8",
        "400mm f/2.8", "50mm f/1.0", "12-24mm f/4",
        "135mm f/1.8", "Tilt-Shift 24mm f/3.5", "8-15mm Fisheye",
        "70-200mm f/2.8"
    ]

    styles = [
        "award-winning wildlife", "editorial cover", "cinematic still",
        "fine art gallery", "commercial product", "documentary",
        "fashion editorial", "scientific macro", "sports action",
        "architectural interior"
    ]

    lighting = [
        "golden hour backlight", "softbox Rembrandt", "dappled forest",
        "blue hour ambient", "studio butterfly", "silhouette contrast",
        "LED ring light", "candlelit warm", "neon urban", "moonlit"
    ]

    cats = [
        "Maine Coon", "Siamese", "Bengal", "Sphynx", "Ragdoll",
        "British Shorthair", "Abyssinian", "Persian", "Scottish Fold",
        "Norwegian Forest"
    ]

    actions = [
        "mid-leap", "grooming", "playing", "sleeping", "stretching",
        "climbing", "hunting", "yawning", "curious gaze", "pouncing"
    ]

    # Generate random combinations
    prompts = []
    for _ in range(1000):
        style = random.choice(styles)
        cat = random.choice(cats)
        action = random.choice(actions)
        camera = random.choice(cameras)
        lens = random.choice(lenses)
        light = random.choice(lighting)
        aperture = round(1.4 + random.random() * 7.6, 1)  # Random f-stop between f/1.4 and f/9
        shutter_speed = 1 / random.randint(1, 4000)  # Random shutter speed
        iso = random.choice([100, 200, 400, 800, 1600, 3200, 6400])  # Random ISO

        prompt = (f"{style} photo of {cat} cat {action} | "
                  f"{camera} with {lens} | {light} lighting | "
                  f"f/{aperture} {shutter_speed:.0f}s ISO {iso} | "
                  f"Technical excellence award composition")

        prompts.append(prompt)

    return prompts

Once we have an array of prompts, we can divide them into chunks so that each batch job generates more than one image. This approach can be further optimized later (not part of this article).

The critical aspect here is job submission. The prompts are supplied to the job as environment variables using the -v switch. The -q switch assigns the gpu.q to the job, which is assumed to be configured across the GPU nodes. The -l switch selects 1 GPU device per job, ensuring the GPU integration sets the appropriate NVIDIA environment variables so that jobs don’t conflict. This is accomplished through the new qgpu utility called in the gpu.q's prolog. For the Open Cluster Scheduler, you need to configure this manually. That means configuring the NVIDIA_GPUS resource as RSMAP with the GPU ID range, while the job itself must convert the SGE_HGR_NVIDIA_GPUS environment variable set in the job to the NVIDIA environment variables (see this blog post).

The job itself, executed on the compute node, is the python3 script located on the shared storage.

Here's the submit.py script:

def main():
    prompts = generate_pro_prompts()

    parser = argparse.ArgumentParser()
    parser.add_argument('--chunk-size', type=int, required=True,
                       help='Number of prompts per job')
    args = parser.parse_args()

    # Split prompts into chunks
    chunks = [prompts[i:i+args.chunk_size]
             for i in range(0, len(prompts), args.chunk_size)]

    # Submit jobs
    for i, chunk in enumerate(chunks):
        try:
            # Serialize chunk to JSON for safe transmission
            prompts_json = json.dumps(chunk)

            subprocess.run(
                [
                    "qsub", "-j", "y", "-b", "y",
                    "-v", f"INPUT_PROMPT={prompts_json}",
                    "-q", "gpu.q",
                    "-l", "NVIDIA_GPUS=1",
                    "python3", "/home/nvidia/genai1/run.py"
                ],
                check=True
            )
            print(f"Submitted job {i+1} with {len(chunk)} prompts")
        except subprocess.CalledProcessError as e:
            print(f"Failed to submit job {i+1}. Error: {e}")

if __name__ == "__main__":
    main()

The run.py script carries out the intensive process using the cat-related prompts. Note, there is a lot of potential for obvious improvements - which is not the focus of the article. It retrieves prompts from the INPUT_PROMPT environment variable together with the unique JOB_ID assigned by the Gridware Cluster Scheduler, exactly like in SGE. The images are stored in the job's working directory, which is assumed to be shared. Learn more about the DiffusionPipeline utilizing Stable Diffusion at HuggingFace.

import logging
import os
import json
from diffusers import DiffusionPipeline
import torch
from PIL import Image

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def generate_image(prompt: str, output_path: str):
    try:
        logging.info("Loading model...")
        pipe = DiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            use_safetensors=True,
            variant="fp16"
        )
        pipe.to("cuda")
    except Exception as e:
        logging.error("Pipeline init error: %s", str(e))
        return

    try:
        logging.info("Generating: %s", prompt)
        image = pipe(prompt=prompt).images[0]
        image.save(output_path)
        logging.info("Saved: %s", output_path)
    except Exception as e:
        logging.error("Generation error: %s", str(e))

if __name__ == "__main__":
    # Get job context
    job_id = os.getenv("JOB_ID", "unknown_job")

    # Parse prompts from JSON
    try:
        prompt_list = json.loads(os.getenv("INPUT_PROMPT", "[]"))
    except json.JSONDecodeError:
        logging.error("Invalid prompt format!")
        prompt_list = []

    # Process all prompts in chunk
    for idx, prompt in enumerate(prompt_list):
        if not prompt.strip():
            continue  # skip empty prompts

        output_path = f"image_{job_id}_{idx}.png"
        generate_image(prompt, output_path)

To submit the AI inference jobs to the system, simply execute:

python3 submit.py --chunk-size 5

This results in 200 queued jobs, each capable of creating 5 images.

Supervising AI Job Execution

When correctly configured, the Gridware Cluster Scheduler executes jobs on available GPUs at the appropriate time. You can check the status with qstat or get job-related information with qstat -j <jobid>. After a while, you will have your 1000 cat images. Once completed, you can also view per-job GPU usage in qacct -j <jobid>, with metrics like nvidia_energy_consumed and nvidia_power_usage_avg, as well as the submission command line with the prompts, for example:

qacct -j 2900
==============================================================
qname                              gpu.q
hostname                           XXXXXX.server.com
group                              nvidia
owner                              nvidia
project                            NONE
department                         defaultdepartment
jobname                            python3
jobnumber                          2900
taskid                             undefined
pe_taskid                          NONE
account                            sge
priority                           0
qsub_time                          2025-01-26 11:02:12.650259
submit_cmd_line                    qsub -j y -b y -v 'INPUT_PROMPT=["award-winning wildlife photo of Maine Coon cat mid-leap | Canon EOS R5 with 85mm f/1.2 | golden hour backlight lighting | f/2.4 1/1500s ISO 300 | Technical excellence award composition", "editorial cover photo of Siamese cat grooming | Sony u03b11 with 24-70mm f/2.8 | softbox Rembrandt lighting | f/2.9 1/2000s ISO 400 | Technical excellence award composition", "cinematic still photo of Bengal cat playing | Nikon Z9 with 100mm Macro f/2.8 | dappled forest lighting | f/3.4 1/2500s ISO 500 | Technical excellence award composition", "fine art gallery photo of Sphynx cat sleeping | Phase One XT with 400mm f/2.8 | blue hour ambient lighting | f/3.9 1/3000s ISO 600 | Technical excellence award composition", "commercial product photo of Ragdoll cat stretching | Hasselblad X2D with 50mm f/1.0 | studio butterfly lighting | f/4.4 1/3500s ISO 100 | Technical excellence award composition"]' -q gpu.q -l NVIDIA_GPUS=1 python3 /home/nvidia/genai1/run.py
start_time                         2025-01-26 13:13:06.172316
end_time                           2025-01-26 13:15:19.362819
granted_pe                         NONE
slots                              1
failed                             0
exit_status                        0
ru_wallclock                       133
ru_utime                           135.470
ru_stime                           1.319
ru_maxrss                          7142080
ru_ixrss                           0
ru_ismrss                          0
ru_idrss                           0
ru_isrss                           0
ru_minflt                          74728
ru_majflt                          0
ru_nswap                           0
ru_inblock                         0
ru_oublock                         14208
ru_msgsnd                          0
ru_msgrcv                          0
ru_nsignals                        0
ru_nvcsw                           2488
ru_nivcsw                          846
wallclock                          134.238
cpu                                136.789
mem                                6587.407
io                                 0.096
iow                                0.000
maxvmem                            61648994304
maxrss                             7112032256
arid                               undefined
nvidia_energy_consumed             26058.000
nvidia_power_usage_avg             158.000
nvidia_power_usage_max             158.000
nvidia_power_usage_min             0.000
nvidia_max_gpu_memory_used         0.000
nvidia_sm_clock_avg                1980.000
nvidia_sm_clock_max                1980.000
nvidia_sm_clock_min                1980.000
nvidia_mem_clock_avg               2619.000
nvidia_mem_clock_max               2619.000
nvidia_mem_clock_min               2619.000
nvidia_sm_utilization_avg          0.000
nvidia_sm_utilization_max          0.000
nvidia_sm_utilization_min          0.000
nvidia_mem_utilization_avg         0.000
nvidia_mem_utilization_max         0.000
nvidia_mem_utilization_min         0.000
nvidia_pcie_rx_bandwidth_avg       0.000
nvidia_pcie_rx_bandwidth_max       0.000
nvidia_pcie_rx_bandwidth_min       0.000
nvidia_pcie_tx_bandwidth_avg       0.000
nvidia_pcie_tx_bandwidth_max       0.000
nvidia_pcie_tx_bandwidth_min       0.000
nvidia_single_bit_ecc_count        0.000
nvidia_double_bit_ecc_count        0.000
nvidia_pcie_replay_warning_count   0.000
nvidia_critical_xid_errors         0.000
nvidia_slowdown_thermal_count      0.000
nvidia_slowdown_power_count        0.000
nvidia_slowdown_sync_boost         0.000

This example demonstrates how easily you can utilize the Gridware Cluster Scheduler to keep your GPUs engaged continuously, regardless of the frameworks, models, or input data your cluster users employ for single-GPU jobs, multi-GPU jobs, or multi-node multi-GPU jobs using MPI. Using containers or just using applications available on the command line.

Below you can find some output examples...enjoy :)