Mastering GPUs with Open Cluster Scheduler (2024-07-1)
Mastering GPUs with Open Cluster Scheduler's RSMAP
Check out the full article here.
Unlock the full potential of your GPU resources with Open Cluster Scheduler's latest feature — Resource Map (RSMAP). This powerful and flexible resource type ensures efficient and conflict-free utilization of specific resource instances, such as GPUs.
Why Choose RSMAP?
- Collision-Free Use: Ensures exclusive access to resources, preventing conflicts between jobs.
- Detailed Monitoring & Accounting: Facilitates precise tracking and reporting of actual resource usage.
- Versatile Resource Management:
- Host-Level Resources: Manage local resources such as port numbers, GPUs, NUMA resources, network devices, and storage devices.
- Global Resources: Manage network-wide resources like IP addresses, DNS names, license servers, and port numbers.
Example: Efficient GPU Management
Define Your GPU Resource: Begin by opening the resource complex configuration using
qconf -mc
and add the following line:GPU gpu RSMAP <= YES YES NONE 0
This defines a resource named
GPU
using the RSMAP type, marking it as requestable and consumable with specific constraints.Initialize Resource on Hosts: Assign values to the GPU resources on a specific host by modifying the host configuration with
qconf -me <hostname>
. For a host with 4 GPUs:complex_values GPU=4(0 1 2 3)
This indicates the host has 4 GPU instances with IDs 0, 1, 2, and 3.
Submit Your Job: Request GPU resources in your job script:
#!/bin/bash env | grep SGE_HGR
Submit the job with the command:
qsub -l GPU=2 ./job.sh
Your job will now be allocated the requested GPU resources, which can be confirmed by checking the output for granted GPU IDs. Convert these IDs for use with NVIDIA jobs:
export CUDA_VISIBLE_DEVICES=$(echo $SGE_HGR_GPU | tr ' ' ',')
This innovative approach to resource management enhances both performance and resource tracking, making it a must-have for efficient computing. Plus, HPC Gridware set to release a new GPU package featuring streamlined configuration, improved GPU accounting, and automated environment variable management, taking the hassle out of GPU cluster management.
For more detailed information, check out the full article here. It's your go-to guide for mastering GPU management with the Open Cluster Scheduler!
Open Cluster Scheduler: The Future of Open Source Workload Management (2024-06-10)
See also our announcement at HPC Gridware
Dear Community,
We are thrilled to announce that the source code repository for the Open Cluster Scheduler is now officially open-sourced and available at github.com/hpc-gridware/clusterscheduler.
The Open Cluster Scheduler is the cutting-edge successor to renowned open-source workload management systems such as "Sun Grid Engine", "Univa Grid Engine Open Core", "Son of Grid Engine," and others. With a development history spanning over three decades, its origins can be traced back to the Distributed Queueing System (DQS), and it achieved widespread adoption under the name "Sun Grid Engine".
A Solution for the AI Era
As the world pivots towards artificial intelligence and high-performance computing, the necessity for an efficient and open-source cluster scheduler has never been more urgent. In today's GPU cluster environments, harnessing full hardware utilization is not only economically beneficial but also accelerates results, enables more inference tasks per hour, and facilitates the creation of more intricate AI models.
Why Open Cluster Scheduler?
There is a real gap in the market for open-source workload managers, and Open Cluster Scheduler is here to fill it with a whole host of remarkable features:
- Dynamic, On-Demand Cluster Configuration: Make changes without the need to restart services or daemons.
- Standard-Compliant Interfaces and APIs: Enjoy compatibility with standard command-line interfaces (qsub, qstat, …) and standard APIs like DRMAA.
- High Throughput: Efficiently handle millions of independent compute jobs daily.
- Mixed Job Support: Run large MPI jobs alongside small, short single-node tasks seamlessly without altering configurations.
- Rapid Submission: Submit thousands of different jobs within seconds.
- High Availability: Ensure reliability and continuous operation.
Optimized for Performance
Open Cluster Scheduler is meticulously optimized across all dimensions:
- Binary Protocol Between Daemons: Enhances communication efficiency.
- Multi-threaded Scheduler: Ensures optimal performance.
- Written in C++/C: Delivers robust and high-speed computing.
- Multi-OS and Architecture Support: Compatible with architectures including AMD64, ARM64, RISC-V, and more.
Looking Forward
We are committed to evolving Open Cluster Scheduler into a modern solution that will be capable of managing highly demanding compute workloads across diverse computational environments, whether on-premises or in the cloud.
We invite you to explore, contribute, and join us in this exciting new chapter. Together, we can shape the future of high-performance computing.
Visit our repository: github.com/hpc-gridware/clusterscheduler
Thank you for your continued support and enthusiasm.
Sincerely,
Daniel, Ernst, Joachim
Enhancing wfl with Large Language Models: Researching the Power of GPT for (HPC/AI) Job Workflows (2023-05-14)
In today's world of ever-evolving technology, the need for efficient and intelligent job workflows is more important than ever. With the advent of large language models (LLMs) like GPT, we can now leverage the power of AI to create powerful and sophisticated job workflows. In this blog post, I'll explore how I've enhanced wfl, a versatile workflow library for Go, by integrating LLMs like OpenAI's GPT. I'll dive into three exciting use cases: job error analysis, job output transformation, and job template generation.
wfl: A Brief Overview
wfl is a flexible workflow Go library designed to simplify the process of creating and managing job workflows. It's built on top of the DRMAA2 standard and supports various backends like Docker, Kubernetes, Google Batch, and more. With wfl, users can create complex workflows with ease, focusing on the tasks at hand rather than the intricacies of job management.
Enhancing the go wfl library with LLMs
I've started to enhance wfl by integrating large language models (OpenAI), enabling users to harness the power of AI to enhance their job workflows even further. By utilizing GPT's natural language understanding capabilities, we can now create more intelligent and adaptable workflows that can be tailored to specific requirements and challenges. This not only expands the possibilities for research but also increases the efficiency of job workflows. These workflows can span various domains, including AI workflows and HPC workflows.
It's important to note that this is a first research step in applying LLMs to wfl, and I expect to find new and exciting possibilities build upon these three basic use cases.
1. Job Error Analysis
Errors are inevitable in any job workflow, but understanding and resolving them can be a time-consuming and tedious process. With the integration of LLMs in wfl, we can now analyze job errors more efficiently and intelligently. By applying a prompt to an error message, the LLM can provide a detailed explanation of the error and even suggest possible solutions. This can significantly reduce the time spent on debugging and increase overall productivity.
2. Job Output Transformation
Sometimes, the raw output of a job can be difficult to understand or may require further processing to extract valuable insights. With LLMs, we can now apply a prompt to the output of a job, transforming it into a more understandable or usable format. For example, we can use a prompt to translate the output into a different language, summarize it, or extract specific information. This can save time and effort while enabling to extract maximum value from my job outputs.
3. Job Template Generation
Creating job templates can be a complex and time-consuming process, especially when dealing with intricate workflows. With the integration of LLMs in wfl, we can now also generate job templates based on textual descriptions, making the process more intuitive and efficient. By providing a prompt describing the desired job, the LLM can generate a suitable job template that can be analyzed, customized and executed. This not only simplifies the job creation process but also enables users to explore new possibilities and ideas more quickly. Please use this with caution and do not execute generated job templates without additional security verifications! I guess such verifications when automated could be a whole new research area.
Conclusion
The integration of large language models like GPT into wfl has opened up a world of possibilities for job workflows in HPC, AI, and enterprise jobs. By leveraging the power of AI, you can now create more intelligent and adaptable workflows that can address specific challenges and requirements. Further use cases, like building whole job flows upon the building blocks, needs to be investigated
To learn more about wfl and how to harness the power of LLMs for your job workflows, visit the WFL GitHub repository: https://github.com/dgruber/wfl/
A basic sample application demonstrating the Go interface is here: https://github.com/dgruber/wfl/tree/master/examples/llm_openai
Streamline Your Machine Learning Workflows with the wfl Go Library (2023-04-10)
wfl is a versatile and user-friendly Go library designed to simplify the management and execution of workflows. In this blog post, we will explore how wfl can be employed for various machine learning tasks, including sentiment analysis, artificial neural networks (ANNs), hyperparameter tuning, and convolutional neural networks (CNNs).
Sentiment Analysis with wfl
Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text. One straightforward approach for sentiment analysis is to use textblob, like in wfl's sentiment analysis example. This example demonstrates how to use a simple sentiment analysis model to classify marketing phrases as positive or negative.
Sentiment Analysis with ANNs and wfl
To enhance the performance of sentiment analysis, we can use an artificial neural network (ANN) trained on a dataset of labeled marketing phrases. The sentiment analysis with Keras example demonstrates how to implement, train, and deploy an ANN using the Keras library and wfl. This example shows how to use the wfl library to manage the training workflow and execute the ANN model for sentiment analysis.
Hyperparameter Tuning with wfl
Hyperparameter tuning is the process of finding the best set of hyperparameters for a machine learning model. wfl's hyperparameter tuning example demonstrates how to perform hyperparameter tuning using the Keras library and wfl. This example shows how to use wfl to manage and execute a grid search to find the optimal hyperparameters, such as learning rate, batch size, and epochs, for a deep learning model.
Hyperparameter Tuning with Google Batch
As hyperparameter tuning can be computationally expensive, it can be beneficial to distribute the workload across multiple machines. wfl's hyperparameter tuning with Google Batch example demonstrates how to use the Google Batch implementation of the DRMAA2 interface to distribute the hyperparameter tuning workload on Google Cloud, significantly accelerating the process and reducing the computational burden on your local machine.
Convolutional Neural Networks with Cifar10 and wfl
Convolutional neural networks (CNNs) are a type of deep learning model particularly suited for image classification tasks. The CNN with Cifar10 example demonstrates how to use wfl to manage the training workflow of a CNN using the Cifar10 dataset. This example shows how to use wfl to train a CNN on Google Cloud and store the trained model in a Google Cloud Storage bucket.
In conclusion, wfl is a nice tool for streamlining your machine learning workflows, from simple sentiment analysis to intricate CNNs. It offers an easy-to-use interface for managing and executing machine learning tasks, and its integration with cloud platforms like Google Cloud enables you to scale your workloads effortlessly. Give wfl a try and see how it can enhance your machine learning projects! Any feedback - especially things which don't work or are hard to figure out is welcome. Please use github for opening an issue.
Creating Kubernetes based UberCloud HPC Application Clusters using Containers (2020-12-16)
This article was originally published at UberCloud's Blog
UberCloud provides all necessary automation for integrating cloud based self-service HPC application portals in enterprise environments. Due to the significant differences inside the IT landscape of large organizations we are continuously challenged providing the necessary flexibility within our own solution stack. Hence we continuously evaluate newly adopted tools and technologies about the readiness to interact with UberCloud’s technology.
Recent adoption of Kubernetes not just for enterprise workloads but for all sorts of applications, be it on the edge, AI, or for HPC, has strong focus. We created hundreds of Kubernetes clusters on various cloud providers hosting HPC applications like Ansys, Comsol, OpenFoam and many more. We can deploy fully configured HPC clusters which are dedicated to an engineer on GKE or AKS within minutes. We can also use EKS but the deployment time of an EKS cluster is at this point in time significantly slower as on the other platforms (around 3x times). While GKE is excellent and has been my favorite service (due to its deployment speed and its good APIs), AKS has begun in the last months to get really strong. Many features which are relevant for us (like using spot instances and placement groups) and its speed in terms of AKS cluster allocation time (now even almost one minute faster as GKE - 3:30 min. from 0 to a fully configured AKS cluster) have been implement on Azure. Great!
When managing HPC applications in dedicated Kubernetes clusters one challenge remains: How to manage fleets of clusters distributed across multiple clouds? At UberCloud we are building simple tools which takes HPC application start requests and turns it into a fully automated cluster creation and configuration job. One very popular way is to put this logic behind self-service portals where the user selects an application he/she want to use. One other way is creating those HPC applications based on events in workflows, CI/CD and gitops pipelines. Use cases are automated application testing, running automated compute tasks, cloud bursting, infrastructure as code integrations, and more. To support those tasks we’ve developed a container which turns an application and infrastructure description into a managed Kubernetes cluster independent of where the job runs and on which cloud provider and regions the cluster is created.
Due to the flexibility of containers UberCloud’s cluster creation container can be used in almost all modern environments which support containers. We are using it as a Kubernetes job and as CI/CD tasks. When the job is finished, the engineer has access to a fully configured HPC desktop including a HPC cluster attached.
Another integration we just tested is Argo. Argo is a popular workflow engine targeted and working on top of Kubernetes. We have a test installation running on GKE. As the UberCloud HPC cluster creation is fully wrapped inside a container running a single binary the configuration required to integrate it in a Argo workflow is very minimal.
After the workflow (task) is finished, the engineer get’s automatically access to the freshly created remote visualization application running on a newly allocated AKS cluster spanning two node pools having GUI based remote Linux desktop access setup.
The overall AKS cluster creation, configuration, and deployment of our services and HPC application containers just took a couple of minutes. A solution targeted for IT organizations challenged by the task of rolling out HPC applications for their engineers but required to work with modern cloud based technologies.
UberCloud Releases Multi-Cloud, Hybrid-Cloud HPC Application Platform (2020/11/06)
The way enterprises run High Performance Computing (HPC) applications has changed. With Cloud providers offering improved security, better cost/performance, and seemingly endless compute capacity, more enterprises are turning to Cloud for their HPC workloads.
However, many companies are finding that replicating an existing on-premise HPC architecture in the Cloud does not lead to the desired breakthrough improvements. With this in mind, from day one, the UberCloud HPC Application Platform has been built with cloud computing in mind, resulting in highly increased productivity of the HPC engineers, significantly improving IT security, reducing cloud costs and administrative overhead to a minimum, and maintaining full control for engineers and corporate IT over their HPC cloud environment. Today, we are announcing UberCloud’s next-generation HPC Application Platform.
Building blocks of the UberCloud Platform, including HPC, Cloud, Containers, and Kubernetes, have been previously discussed on HPCwire: Kubernetes, Containers and HPC, and Kubernetes and HPC Applications in Hybrid Cloud Environments.
Key Stakeholders when Driving HPC Cloud Adoption
When we started designing the UberCloud HPC Application platform we recognized that three major stakeholders are crucial for the overall success of a company’s HPC cloud journey: HPC engineers, Enterprise IT, and the HPC IT team.
HPC application engineers are the driving force behind innovation. To excel in (and enjoy) their job they require a frictionless, self-service user portal for allocating the computational resources when they are required. They don’t necessarily need to understand how compute nodes, GPUs, storage, or fast network interconnects have to be configured. They expect to be able to allocate and shutdown fully configured HPC application environments.
Enterprise IT demands the necessary software tools and pre-configured containerized HPC applications for creating fully automated, completely tested environments. These environments must be suited to interact with HPC applications and their special requirements for resources and license servers. The platform needs to be pluggable to modern IT environments and support technologies like CI/CD pipelines and Kubernetes orchestration.
The HPC IT team (often quite independent from Enterprise IT) requires a hybrid cloud strategy for enhancing their existing on-premise HPC infrastructure with cloud resources for bursting and hybrid cloud scenarios. This team demands control on software versions and puts emphasis on the entire engineering lifecycle, from design to manufacturing.
Introducing the UberCloud HPC Application Platform
The UberCloud HPC Application Platform aims at supporting each of the three major key stakeholders during their HPC cloud adoption journey. How is that achieved?
For the HPC application engineers UberCloud provides a self-service HPC user interface where they select their application(s) along with the hardware parameters they need. With a single click the fully automated UberCloud HPC Application Platform allocates the dedicated computing infrastructure, deploys the application, and configures access for the engineer for instant productivity. Similarly, the HPC application infrastructure can be resized at any given point in time to run distributed memory simulations, parameter studies, or a design of experiments. After work is done the application and the simulation platform can be safely shut down.
Enterprise IT operations often have their own way of managing cloud-based resources. Infrastructure as Code, GitOps, and DevOps are some of the paradigms found in those organizations. The UberCloud HPC Application Platform contains a management tool which can be integrated in any kind of automation or CI/CD pipeline tool chains. UberCloud’s application platform management tool takes care of all aspects of managing containerized HPC applications using Kubernetes based container orchestrators like GKE, AKS, and EKS.
HPC IT teams require integration points for allocating cloud resources and distributing HPC jobs between their on-premise HPC clusters and dynamically allocated cloud resources. The UberCloud HPC Cloud Dispatcher provides batch job interfaces for hybrid cloud, cloud bursting, and high-throughput computing. It relies on open standards through the whole application stack to provide stable integration interfaces.
Putting UberCloud’s HPC Application Platform into Practice
Our first customer that enjoyed the benefits of the UberCloud HPC Application Platform is FLSmidth, a Danish multinational engineering company providing global cement and mineral industries with factories, machinery, services and know-how. The Proof of Concept implementation at the end of last year has been recently summarized here, and the extended case study (including a description of the hybrid cloud architecture) is freely available through This e-mail address is being protected from spambots. You need JavaScript enabled to view it. .
UberCloud Webinar with Microsoft (2020/10/26)
It is never to late to consider moving HPC workload to the cloud. Registration for this weeks webinar is still open!
Check out Wolfgang's awesome video message about the key values of UberCloud's HPC Application Platform.
https://www.linkedin.com/feed/update/urn:li:activity:6725420905389944832/
Kubernetes Topology Manager vs. HPC Workload Manager (2020-04-15)
Looks like another gap between Kubernetes and traditional HPC scheduler is solved: Kubernetes 1.18 enables a new feature called Topology Manager.
That’s very interesting for HPC workload since the process placements on NUMA machines have a high impact for compute, IO, and memory intensive applications. The reasons are well known:
- access latency differs - well, it is NUMA (non-uniform memory access)!
- when processes move between cores and chips, caches needs to be refilled (getting cold)
- not just memory, also device access (like GPUs, network etc.) is non-uniform
- device access needs to be managed (process which runs on a core needs should access a GPU which is close to the CPU core)
- when having a distributed memory job all parts (ranks) should run with the same speed
In order to get the best performance out of NUMA hardware, Sun Grid Engine introduced around eleven years ago a feature which lets processes get pinned to cores. There are many articles in my blog about that. At Univa we heavily improved it and created a new resource type called RSMAP which combines core binding with resource selection, i.e. when requesting a GPU or something else, the process gets automatically bound to cores or NUMA nodes which are associated with the resource (that’s called RSMAP topology masks). Additionally the decision logic was moved away from the execution host (execution daemon) to the central scheduler as it turned out to be necessary for making global decisions for a job and not wasting time and resources for continued rescheduling when local resources don’t fit.
Back to Kubernetes.
The new Kubernetes Topology Manager is a component of Kubelet which also takes care about these NUMA optimizations. That it is integrated in Kubelet, which runs locally on the target host is the first thing to note.
As Topology Manager provides a host local logic (like in Grid Engine at the beginning) it can: a.) make only host local decisions and b.) lead to wasted resources when pods needs to be re-scheduled many times to find a host which fits (if there is any). That’s also described in the release notes as a known limitation.
How does the Topology Manager work?
The Topology Manager provides following allocation policies: none, best-effort, restricted, and single-numa-node. These are kubelet flags to be set.
The actual policy which is applied to the pod depends on the kubelet flag but also on the QoS class of the pod itself. The QoS class of the pod depends on the resource setting of the pod description. If cpu and memory are requested in the same way within the limits and requests section then the QoS class is Guaranteed. Other QoS classes are BestEffort and Burstable.
The kubelet calls so called Hint Providers for each container of the pod and then aligns them with the selected policy, like checking if it works well with single-numa-node policy. When set to restricted or single-numa-node it may terminate the pod when no placement if found so that the pod can be handled by external means (like rescheduling the pod). But I’m wondering if that will work when having side-car like setups inside the pod.
The sources of the Topology Manager are well arranged here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager
In order to summarize. A bit complicated since I can’t really find the decision logic of the hint provider documented and so the source code will be the source of truth. Also, since it is a host local decision it will be pain for distributed memory jobs (like MPI jobs). Maybe they move the code one day to the scheduler, like it was done in Grid Engine. It is really good that Kubernetes takes now also care about those very specific requirements. So, it is still on the right way to catch up with all required features for traditional HPC applications. Great!
At UberCloud we closely track Kubernetes as platform for HPC workload and are really happy to see developments like the Topology Manager.
Hybrid Cloud interactive HPC Applications on Kubernetes (2020-03-19)
We just published a follow up article about our experiences at UberCloud running HPC Engineering Applications on different Clouds and on-premises using Kubernetes as middleware-stack.
Paper about Virtual Grid Engine (2019/12/12)
Over in Japan researchers working on the K supercomputer recently published a paper about a middleware software which they call VGE - Virtual Grid Engine. It allows them to run bio informatics software in Grid Engine style on K. Certainly a great read!Kubernetes, Containers, and HPC at UberCloud (2019-09-20)
Long time no updates on my blog. Need to change that…
@UberCloud we published a whitepaper about our experiences running HPC workload on Kubernetes. You can download it here.
A shortened version appeared today as lead-article at HPCWire.
qsub for Kubernetes - Simplifying Batch Job Submission (2019-02-23)
Many enterprises adopting cloud application platforms like Pivotal Application Service and container orchestrators like Kubernetes for running their business applications and services. One of the advantages for doing so is to have shared resource pools instead of countless application server islands. This not just simplifies infrastructure management and improves security, it also saves much of the costly infrastructure resources. But that must not be the end of the story. Going one step further (or one step back looking why resource management systems like Borg or Tupperware have been build) you will certainly find other groups within your company being hungry for spare resources in your clusters. Data scientists, engineers, and bioinformaticians need to execute masses of batch jobs in their daily job. So why not allowing them accessing your spare Kubernetes cluster resources you are providing to your enterprise developers? With the Pivotal’s Container Service (PKS) control plane cluster creation and resizing even on-premises is just a matter of one API or command line call. With PKS the right tooling for managing a cluster which fills up your anyhow available resources is available.
One barrier for the researchers which are used to run their workloads within their HPC environment can be the different interfaces. If you worked for decades with the same HPC tooling going to Kubernetes can be a challenge. For Kubernetes you need to write declarative yaml files describing your jobs but users might already have complicated, imperative job submission scripts using command line tools like qsub, bsub, or sbatch. So why not having a similar job submission tool for Kubernetes? I started an experiment writing one using the interfaces I’ve built a while ago (the core is basically just one line of wfl using drmaa2os under the hood). After setting up GKE or an PKS 1.3 Kubernetes cluster running a batch job is just a matter of one command line call.
$ export QSUB_IMAGE=busybox:latest
$ qsub /bin/sh -c 'for i in `seq 1 100`; do echo $i; sleep 1; done'
Submitted job with ID drmaa2osh9k9v
$ kubectl logs --follow=true jobs/drmaa2osh9k9v
1
2
3
4
More details about the job submission tool for Kubernetes you can find here. Note that it is provided AS IS.
One thing to have in mind is that there is no notable job queueing and job prioritization system build into the Kubernetes scheduler. If you are submitting a batch job and the cluster is fully occupied the job submission command will block.
Kubernetes allows to hook in other schedulers. kube-batch is one of the notable activities here. Poseidon provides another alternative for the default scheduler which claims to be extremely scalable while allowing complex rule constraints. Univa’s Command provides an alternative scheduler as well. Note that these schedulers can also be used with qsub by specifying the scheduler during job submission time.
Combining Technical and Enterprise Computing (2019-01-23)
New workload types and management systems popping up once in a while. Blockchain based workload types started to run on CPUs, then on GPUs, some of them later even on ASICs. Recent success of AI systems are pushing GPUs, and in analogy to blockchains, specialized circuits to its boundaries. The interesting point here is that AI and blockchain based technologies are from general business interest, i.e. are the forces for lot’s of new business models in many industries. The automotive industry for example (re-)discovered artificial neural networks for solving many of the tasks required to let cars drive them-selfs. I had the luck joining a research team taking care of image based environment perception of a famous car maker for a short while more than 10 years ago. From my perspective the recent developments were not so clear back than - even we had already beginning of the 90s self-driving cars steered by neural networks.
The high-performance computing community has a long tradition in managing tons of compute workload, maximizing resource usage, and queueing and prioritizing workload. But business critical workloads for many enterprises are much different by nature. Stability, interconnectivity, and security are key requirements. Today the boundaries get blurred. Hyperscalers had to solve unique problems running huge amount of enterprise applications. They had to build their own systems which combined traditional batch scheduling with services. Finally Google came up with an open source system (Kubernetes) to allow companies building their own workload management system. Kubernetes solves a lot of core problems around container orchestration but many things are missing. Pivotal’s Container Service enriches Kubernetes when it comes to solve what we at Pivotal call day 2 issues. Kubernetes needs to be created, updated, and maintained. It’s not a closed box - PKS gives you choice, opinions, and best practices running a platform as product within large organizations.
But back to technical computing. What is clearly missing within Kubernetes are capabilities built into traditional HPC schedulers over decades. Batch jobs are only supported rudimentarily at the moment. There is no queuing, sophisticated job prioritization, nor first-class support for MPI workloads like we know from the HPC schedulers. Also the interfaces are completely different. Many products are built on the HPC scheduler interfaces. Also Kubernetes will not replace traditional HPC schedulers, like Hadoop’s way of doing batch scheduling (what today most people associate with batch scheduling) did not replace classic HPC schedulers, they are still around and will survive for a reason. Also the cloud let you think to make queueing obsolete - but only in a perfect world where money and resources are unlimited.
What we need in order to tackle large scale technical computing problems is a complete system architecture combining different aspects and different specialized products. There are three areas I want to look at:
- Applications
- Containers
- Batch jobs
Pivotal has the most complete solution when it comes to provide an application runtime which abstracts about pure container orchestration. The famous cf push command works with all kind of programming languages, including Java, Java/Spring, Go, Python, node…it keeps the developer focusses on its application / business logic rather than on building and wiring containers. That’s already all completely automated since years by concepts like buildpacks, service discovery etc. Additionally to that we need a container runtime for pre-defined containers. This is what Pivotal’s Container Service (PKS) is for. Finally we have the batch job part which can be your traditional HPC system, might it be Univa Grid Engine, slurm, or HTCondor.
If we draw the full picture of a modern architecture supporting large scale AI and technical computing workloads it looks like following:
Thanks to open interfaces, RESTful APIs, and software defined networking a smooth interaction is possible. The open service broker API is acting as a bridge in many of the components already.
Enough for now, back to my Macap coffee grinder and later to OOP conference here in Munich.
The Road to Pivotal Container Service - PKS (2019-01-21)
A few days ago Pivotal released version 1.3 of its solution for enterprise ready container orchestration. Unfortunately the days of release parties are gone but let me privately celebrate the advent of PKS 1.3 - right here, right now.
What has happened so far? It all started with a joint project between Google and Pivotal combining two amazing technologies: Google’s Kubernetes and Pivotal’s BOSH. I’m pretty sure most of the readers know about Kubernetes but are not so familiar with BOSH. BOSH can be described as life-cycle management tool for distributed systems (note, that this is not BOSH). Through deployment manifests and operating system images called stemcells BOSH deploys and manages distributed systems on top of infrastructure layers like vSphere, GCP, Azure, and AWS. Well, Kubernetes is a distributed system so why not deploy and manage Kubernetes with BOSH? Project kubo (Kubernetes on BOSH) was born - a joint collaboration between Google and Pivotal. Finally kubo turned into an official Cloud Foundry Foundation project called Cloud Foundry Container Runtime. CFCR in turn is a major building block of PKS.
With the release of PKS 1.3 Pivotal is supporting vSphere, GCP, Azure, and AWS as infrastructure layer. PKS delivers the same interfaces and admin / user experience independent if you need to manage lots of Kubernetes clusters at scale on-premises or in the cloud.
PKS takes care about life-cycle management of Kubernetes itself. With PKS you can manage fleets of Kubernetes clusters with a very small but highly efficient platform operations team. We at Pivotal strongly believe that is much better to run lot’s of small Kubernetes installations rather than one big. For doing so you need the right tooling but also the right methodology and mindset to do so. Infrastructure as code as well as SRE techniques are mandatory to be effective and scalable. Pivotal supports their customers in that regard by enabling them in our famous Operation Dojos.
PKS as commercial product is a joint development between Pivotal and VMware. With the acquisition of Heptio (a company founded by two of the original Kubernetes creators) VMware has first class knowledge about Kubernetes in house. But let’s go deeper in the PKS components in order to get a better understanding what value PKS provides.
As core component PKS offers a control plane which can be accessed through a simple command line tool called pks. Once PKS is installed through Pivotal’s Ops Manager the control plane is ready. In order to create a new Kubernetes cluster you just need to emit one command: pks create-cluster. What happens is that the control plane creates a new BOSH deployment rolling out a Kubernetes cluster on set of newly allocated VMs or machines. BOSH takes care about keeping all components up and running - if a VM fails it is going to be automatically repaired. Resizing the cluster, i.e. adding more worker nodes or removing worker nodes is just a matter of one single command. But that’a not all, dev and platform ops need logging and monitoring. Logging sinks are forwarding everything which happens on process level to a syslog target. Through a full integration into Wavefront monitoring dashboards can made available with just a few settings. PKS and operating system upgrades can be fully automated and executed without application downtime, Pivotal provides pre-configured CI/CD pipelines and a continuous stream of updates through Pivotal Network. Also part of PKS is VMware’s enterprise ready container registry called Harbor which was recently accepted as official CNCF project with more than 100 contributors. VMwares software defined networking NSX-T (also included in PKS) glues everything seamlessly and very secure together and provides scalable routing and load-balancing functionalities.
There is much more to say but I will stop here, making myself a coffee with my greatly engineered Bezzera, and celebrate the release of PKS 1.3! :-)
Well done!
Unikernels Still Alive? (2019-01-05)
During the widespread adoption of container technologies Unikernels have been seen for a very short while as potentiell successor technology. Docker bought Unikernel Systems in January 2016. After that it has been very quiet around the technology itself. There were some efforts for applying it to high performance computing but also in the HPC community it didn’t get popular. In Japanese compute clusters they experimented with McKernel. In Europe we have an EU funded project called Mikelangelo.
But there is some potential that Unikernels come back again. With the quick adoption of firecracker, a virtualizing technology that is able to start VMs in hundreds of milliseconds, unikernels build into VMs can be potentially orchestrated very well. Of course something which needs to be handcrafted and still put a lot of effort into but the major building blocks exist.
The very well known Unik project (which originated at Dell / EMC btw.) has added already support for firecracker. More information about how it works is described in this blog post.
Pivotal’s Function Service and Go (2018-12-15)
Since a few days Pivotal grants access to Pivotal Function Service for early adopters. All what is required for testing PFS is an account at Pivotal Network (which can be setup within minutes) and a request to the get early access for the alpha version of PFS.
PFS installation is quite simple. All what is required is a Kubernetes cluster, which can be easily created by Pivotal’s Container Service (PKS), or alternatively minikube, or GKE can be used.
There are two files to download: the packages and the pfs command line tool (mac, linux,…). Once downloaded and extracted just follow the documentation which explains each of the necessary steps. Basically you need to push the downloaded images to your container registry and then start the installation inside Kubernetes with pfs system install. It automatically installs istio and the knative components on which PFS is build on. The open source project of PFS sponsored by Pivotal is riff.
After installation PFS is ready to use. Three components are important to understand PFS: functions/services, channels for data exchange between the services, and subscriptions for connecting channels with services.
For this early PFS release functions can be written in Java/Spring, JS, or provided as scripts or executables. I went the extra mile and created functions in go which needs to be compiled in executables and registered in PFS as Command functions. Not a big deal. Nevertheless waiting for native support…:)
The first step is setting up your git repo for your functions which my case is here. Your function can be a shell script converting stdin and writing to stdout or any other binary doing the same. For Java you are implementing the Function interface converting a string into another string by an apply() method. In my go example application I’ve a simple wrapper which selects the function which should be executed by an environment variable which I provide during service creation. I’m reading from stdin and write to standard out. Stderr is ignored.
Now you can register the function creating the service reverse calling the binary f which invokes reverse():
pfs function create reverse --git-repo https://github.com/dgruber/f --artifact lxamd64/f
--image $REGISTRY/$REGISTRY_USER/reverse --env "FUNCTION=reverse" --verbose
Afterwards I can register my other function _upper_ the same way:
pfs function create upper --git-repo https://github.com/dgruber/f --artifact lxamd64/f
--image $REGISTRY/$REGISTRY_USER/upper --env "FUNCTION=upper" —verbose
Note that I’ve setup my image registry beforehand (in my case the REGISTRY env variable is set to _gcr.io_).
In order to list the functions we can do:
pfs service list
NAME STATUS
reverse RevisionFailed: Revision "reverse-00001" failed with message: "The target is not receiving traffic.".
upper RevisionFailed: Revision "upper-00001" failed with message: "The target is not receiving traffic.".
pfs service list completed successfully
There is no traffic yet hence nothing is running (scales down to 0).
To test a function we can use the pfs utility with service invoke:
pfs service invoke reverse --text -- -w 'n' -d 'functional test'
curl 35.205.162.82/ -H 'Host: reverse.default.example.com' -H 'Content-Type: text/plain' -w 'n' -d 'functional test'
tset lanoitcnuf
Voila!
The next step is connecting the functions by channels and subscriptions. Channels abstract away from the underlying infrastructure like Kafka or RabbitMQ or the internal test message bus stub:
pfs channel create messages --cluster-bus stub
pfs channel create upperedmessages --cluster-bus stub
pfs subscription create --channel messages --subscriber upper --reply-to upperedmessages
pfs subscription create --channel upperedmessages --subscriber reverse
What we have done now is letting the upper function subscribe to the messages channel and send the output to upperedmessages channel to which the reverse function is subscribed.
Now we have to send a message to the incoming messages channel. This is typically done by an application (see knative serving) but can be tested by creating by accessing a shell in a cluster pod.
I’m following Dave’s recommendation to just do that within a container for testing.
$ kubectl run -it --image=tutum/curl client --restart=Never --overrides='{"apiVersion": "v1", "metadata":{"annotations": {"sidecar.istio.io/inject": "true"}}}'
And then inside the container:
root@client:/# curl -v upperedmessages-channel.default.svc.cluster.local -H "Content-Type: text/plain“ -d mytext
If no pods are running yet they will be started. Yeah!
kubectl get pods
NAME READY STATUS RESTARTS AGE
client 2/2 Running 0 30m
reverse-00001-deployment-86bf59b96-4pz8l 3/3 Running 0 2m
reverse-00001-k9m2q 0/1 Completed 0 1d
upper-00001-deployment-6974ccc47d-6lgz7 3/3 Running 0 1m
upper-00001-lgtn8 0/1 Completed 0 1d
When having issues with messages or subscriptions a look into the logs stub-clusterbus-dispatcher pod can be very helpful (it runs in the knative-eventing namespace).
That’s it! Happy functioning!
The Singularity is :%s/near/here/g - Using wfl and Go drmaa2 with Singularity (2018-12-04)
The Singularity containerization software is re-written in Go. That’s a great message and caught immediately my interest :-). For a short intro into Singularity there are many sources including wikipedia or the official documentation.
A typical problem using Docker in a broader context, i.e. including integrating it in workload managers, is that it runs as a daemon. That means requests to start a Docker container end up in processes which are children of the daemon and not of the original process which initiated the container creation request. Workload managers are built around managing and supervising child processes and have reliable mechanisms for supervising processes which detach itself from the parent processes but they don’t work very well with Docker daemon. That’s why most of the serious workload managers do container orchestration using a different mechanism (like using directly runc or rkt or even handling cgroups and namespaces directly). Supervision includes resource usage measurements and limitations which sometimes exceed Dockers native capabilities. Singularity brings back the expected behavior for workload managers. But there are more reasons people use Singularity which I don’t cover in this article.
This article is about using Singularity in Go applications using wfl and drmaa2. Since I wanted to give Singularity a try by myself I started to add support for it in the drmaa2 go binding by implementing a job tracker. It was a very short journey to get the first container running as most of the necessary functionality was already implemented.
If you are Go programmer and write Singularity workflows either for testing, performance measurements, or for running simulations I would be more than happy to get feedback about future improvements.
DRMAA2 is a standard for job management which includes basic primitives like start, stop, pause, resume. That’s enough to built workflow management applications or even simpler abstractions like wfl around it.
In order to make use of the implemented DRMAA2 calls for Singularity in your Go application you need to include it along with the drmaa2interface dependency
"github.com/dgruber/drmaa2interface"
"github.com/dgruber/drmaa2os"
The next step is to create a DRMAA2 job session. As the underlying implementation need to make the session names and job IDs persistent you need to provide a path to a file which is used for that purposes.
sm, err := drmaa2os.NewSingularitySessionManager(filepath.Join(os.TempDir(), "jobs.db"))
if err != nil {
panic(err)
}
Note that when Singularity is not installed the call fails.
Creating a job session is like defined in the standard. As it fails when it exists (also defined in the standard) we need to do:
js, err := sm.CreateJobSession("jobsession", "")
if err != nil {
js, err = sm.OpenJobSession("jobsession")
if err != nil {
panic(err)
}
}
The JobTemplate is the key for specifying the details of the job and the container.
jt := drmaa2interface.JobTemplate{
RemoteCommand: "/bin/sleep",
Args: []string{"600"},
JobCategory: "shub://GodloveD/lolcow",
OutputPath: "/dev/stdout",
ErrorPath: "/dev/stderr",
}
The RemoteCommand is the application executed within the container and therefore must be accessible inside of the container. Args is the list of arguments for the file. The JobCategory is mandatory and specifies the image to be used. Here we are downloading the image from Singularity Hub (shub). In order to see some output in the shell we need to specify the OutputPath pointing to /dev/stdout. We should do the same for the ErrorPath otherwise we will miss some error messages printed by the applications running within the container.
In order to give Singularity more parameters (which are of course not part of the standard) we can use the extensions:
jt.ExtensionList = map[string]string{
"debug": "true",
"pid": "true",
}
Please check out the drmaa2os singularity job tracker sources for a complete list of evaluated parameters (it might be the case some are missing). debug is a global parameter (before exec) and pid is an exec specific parameter. Since they are boolean parameters it does not matter if they are set to true or just an empty string. Only when set to false or FALSE they will be ignored.
This JobTemplate will result in a process managed by the Go drmaa2 implementation started with
singularty —debug exec —pid shub://GodloveD/lolcow /bin/sleep 600
RunJob() finally creates the process and returns immediately .
job, err := js.RunJob(jt)
if err != nil {
panic(err)
}
The job ID can be used to perform action on the container. It can be suspended:
err = job.Suspend()
and resumed
err = job.Resume()
or killed
err := job.Terminate()
The implementation sends the necessary signals to the process group.
You can wait until it is terminated (with a timeout or drmaa2interface.InfiniteTime).
err = job.WaitTerminated(time.Second * 240)
Don’t forget to close the JobSession at the end and destroy it properly.
js.Close()
sm.DestroyJobSession("jobsession")
While that is easy it still requires lot’s of code when writing real applications. That’s what we have wfl for. wfl now also includes the Singularity support as it is based on the drmaa2 interfaces.
Since last Sunday wfl can make use of default JobTemplate’s which allows to set a container image and parameters like OutputPath and ErrorPath for each subsequent job run. This further reduces the LOC to be written by avoiding RunT().
For starting a Singularity container you need a workflow based on the new SingularityContext which accepts a DefaultTemplate.
var template = drmaa2interface.JobTemplate{
JobCategory: "shub://GodloveD/lolcow",
OutputPath: "/dev/stdout",
ErrorPath: "/dev/stderr",
}
flow := wfl.NewWorkflow(wfl.NewSingularityContextByCfg(wfl.SingularityConfig{DefaultTemplate: template}))
Now flow let’s allow you manage your Singularity containers. For running 100 containers executing the sleep binary in parallel and re-executing any failed containers up 3 times you can do like that.
job := flow.Run("/bin/sleep", "15").
Resubmit(99).
Synchronize().
RetryAnyFailed(3)
For more capabilities of wfl please check out also the other examples at https://github.com/dgruber/wfl.
Of course you can mix backends in your applications, i.e. you can manage OS processes, k8s jobs, Docker containers, and Singularity containers within the same application.
Any feedback welcome.
Creating Processes, Docker Containers, Kubernetes Jobs, and Cloud Foundry Tasks with wfl (2018-08-26)
There are many workload managers running in the data centers of the world. Some of them take care of running weather forecasts, end-of-day reports, reservoir simulations, or simulating functionalities before they are physically built-in in computer chips. Others are taking care of running business applications and services. Finally we have data-based workload types aggregating data and trying to discover new information through supervised and unsupervised learning methods.
All of them have a need to run jobs. Jobs are one of the old concepts in computer science which exists since the 50s of the last century. A job is a unit of work which needs to be executed. That execution needs to be controlled which is typically called job control. Examples of implementations of job control systems are command line shells (like bash, zsh, ...), but also job schedulers (like Grid Engine, SLURM, torque, pbs, LSF) and the new generation container orchestrators (like Kubernetes), or even application platforms (like Cloud Foundry).
All of the mentioned systems have different characteristics if, when, and how jobs are executed. Also how jobs are described is different from system to system. The most generic systems treat binaries or shell scripts as jobs. Container orchestrators executes applications stored in container images. Application platforms are building the executed containers in a transparent, completely automatic, easy to manage, and very secure way.
In large enterprises you will not find a single system handling all kind of workloads. That is unfortunate since it is really something which should be achieved. But the world is moving - new workload types arise, some of them are removed, and again others will stay. New workload types have different characteristics by nature. Around a decade ago I've seen the first appearance of map/reduce workloads with the advent of Hadoop. Data centers tried to figure out how to run this kind of workload side-by-side with their traditional batch jobs.
Later containers became incredibly popular and the need for managing them at scale was approached by a new class of workload managers called container orchestrators. Meanwhile also batch schedulers are enhanced to handle containers and map/reduce systems can do containers. Container orchestrators are now looking for more. They are moving up the stack were application platforms like Cloud Foundry have learned and provide functionalities since years.
But when looking at those systems you will also find lot's of similarities. As mentioned in the beginning they all have the need for running a job for different reasons. A job has a start and a defined end. It can fail to be accepted by the system, fail during execution, or succeed. Often they need more context, like environment variables to be set, job scheduling hints to be provided, or data to be staged.
Standards
If you are reading this blog you are most likely aware that there are standards already defined handling all these tasks. Even been much longer around than most of the new systems they are still applicable for new systems.
In some free time I adapted the DRMAA2 standard to Go and worked on an implementation for different systems. Currently it is working on processes, Docker, Kubernetes, and Cloud Foundry. The Grid Engine implementation is based on the C library, the others are pure Go.
When working on the implementation I wanted to write quickly test applications so I've build a much simpler abstraction around it, which I called wfl.
With wfl it makes even more fun to write applications which need to start processes, Docker containers, Kubernetes jobs, or Cloud Foundry tasks. You can even mix backend types in the same application or point to different Kubernetes clusters within one application. It is kept as simple as possible but still grants access to details of a DRMAA2 job when required.
There are a few examples in the wfl repository demonstrating its usage. A general introduction you find in the README on github.