Kubernetes, Containers, and HPC at UberCloud (2019-09-20)

Long time no updates on my blog. Need to change that…

@UberCloud we published a whitepaper about our experiences running HPC workload on Kubernetes. You can download it here.

A shortened version appeared today as lead-article at HPCWire.

qsub for Kubernetes - Simplifying Batch Job Submission (2019-02-23)

Many enterprises adopting cloud application platforms like Pivotal Application Service and container orchestrators like Kubernetes for running their business applications and services. One of the advantages for doing so is to have shared resource pools instead of countless application server islands. This not just simplifies infrastructure management and improves security, it also saves much of the costly infrastructure resources. But that must not be the end of the story. Going one step further (or one step back looking why resource management systems like Borg or Tupperware have been build) you will certainly find other groups within your company being hungry for spare resources in your clusters. Data scientists, engineers, and bioinformaticians need to execute masses of batch jobs in their daily job. So why not allowing them accessing your spare Kubernetes cluster resources you are providing to your enterprise developers? With the Pivotal’s Container Service (PKS) control plane cluster creation and resizing even on-premises is just a matter of one API or command line call. With PKS the right tooling for managing a cluster which fills up your anyhow available resources is available.

One barrier for the researchers which are used to run their workloads within their HPC environment can be the different interfaces. If you worked for decades with the same HPC tooling going to Kubernetes can be a challenge. For Kubernetes you need to write declarative yaml files describing your jobs but users might already have complicated, imperative job submission scripts using command line tools like qsub, bsub, or sbatch. So why not having a similar job submission tool for Kubernetes? I started an experiment writing one using the interfaces I’ve built a while ago (the core is basically just one line of wfl using drmaa2os under the hood). After setting up GKE or an PKS 1.3 Kubernetes cluster running a batch job is just a matter of one command line call.

$ export QSUB_IMAGE=busybox:latest
$ qsub /bin/sh -c 'for i in `seq 1 100`; do echo $i; sleep 1; done'
Submitted job with ID drmaa2osh9k9v
$ kubectl logs --follow=true jobs/drmaa2osh9k9v
1
2
3
4

More details about the job submission tool for Kubernetes you can find here. Note that it is provided AS IS.

One thing to have in mind is that there is no notable job queueing and job prioritization system build into the Kubernetes scheduler. If you are submitting a batch job and the cluster is fully occupied the job submission command will block.

Kubernetes allows to hook in other schedulers. kube-batch is one of the notable activities here. Poseidon provides another alternative for the default scheduler which claims to be extremely scalable while allowing complex rule constraints. Univa’s Command provides an alternative scheduler as well. Note that these schedulers can also be used with qsub by specifying the scheduler during job submission time.

Combining Technical and Enterprise Computing (2019-01-23)

New workload types and management systems popping up once in a while. Blockchain based workload types started to run on CPUs, then on GPUs, some of them later even on ASICs. Recent success of AI systems are pushing GPUs, and in analogy to blockchains, specialized circuits to its boundaries. The interesting point here is that AI and blockchain based technologies are from general business interest, i.e. are the forces for lot’s of new business models in many industries. The automotive industry for example (re-)discovered artificial neural networks for solving many of the tasks required to let cars drive them-selfs. I had the luck joining a research team taking care of image based environment perception of a famous car maker for a short while more than 10 years ago. From my perspective the recent developments were not so clear back than - even we had already beginning of the 90s self-driving cars steered by neural networks.

The high-performance computing community has a long tradition in managing tons of compute workload, maximizing resource usage, and queueing and prioritizing workload. But business critical workloads for many enterprises are much different by nature. Stability, interconnectivity, and security are key requirements. Today the boundaries get blurred. Hyperscalers had to solve unique problems running huge amount of enterprise applications. They had to build their own systems which combined traditional batch scheduling with services. Finally Google came up with an open source system (Kubernetes) to allow companies building their own workload management system. Kubernetes solves a lot of core problems around container orchestration but many things are missing. Pivotal’s Container Service enriches Kubernetes when it comes to solve what we at Pivotal call day 2 issues. Kubernetes needs to be created, updated, and maintained. It’s not a closed box - PKS gives you choice, opinions, and best practices running a platform as product within large organizations.

But back to technical computing. What is clearly missing within Kubernetes are capabilities built into traditional HPC schedulers over decades. Batch jobs are only supported rudimentarily at the moment. There is no queuing, sophisticated job prioritization, nor first-class support for MPI workloads like we know from the HPC schedulers. Also the interfaces are completely different. Many products are built on the HPC scheduler interfaces. Also Kubernetes will not replace traditional HPC schedulers, like Hadoop’s way of doing batch scheduling (what today most people associate with batch scheduling) did not replace classic HPC schedulers, they are still around and will survive for a reason. Also the cloud let you think to make queueing obsolete - but only in a perfect world where money and resources are unlimited.

What we need in order to tackle large scale technical computing problems is a complete system architecture combining different aspects and different specialized products. There are three areas I want to look at:

  • Applications
  • Containers
  • Batch jobs

Pivotal has the most complete solution when it comes to provide an application runtime which abstracts about pure container orchestration. The famous cf push command works with all kind of programming languages, including Java, Java/Spring, Go, Python, node…it keeps the developer focusses on its application / business logic rather than on building and wiring containers. That’s already all completely automated since years by concepts like buildpacks, service discovery etc. Additionally to that we need a container runtime for pre-defined containers. This is what Pivotal’s Container Service (PKS) is for. Finally we have the batch job part which can be your traditional HPC system, might it be Univa Grid Engine, slurm, or HTCondor.

If we draw the full picture of a modern architecture supporting large scale AI and technical computing workloads it looks like following:

PKS HPC

Thanks to open interfaces, RESTful APIs, and software defined networking a smooth interaction is possible. The open service broker API is acting as a bridge in many of the components already.

Enough for now, back to my Macap coffee grinder and later to OOP conference here in Munich.

The Road to Pivotal Container Service - PKS (2019-01-21)

A few days ago Pivotal released version 1.3 of its solution for enterprise ready container orchestration. Unfortunately the days of release parties are gone but let me privately celebrate the advent of PKS 1.3 - right here, right now.

What has happened so far? It all started with a joint project between Google and Pivotal combining two amazing technologies: Google’s Kubernetes and Pivotal’s BOSH. I’m pretty sure most of the readers know about Kubernetes but are not so familiar with BOSH. BOSH can be described as life-cycle management tool for distributed systems (note, that this is not BOSH). Through deployment manifests and operating system images called stemcells BOSH deploys and manages distributed systems on top of infrastructure layers like vSphere, GCP, Azure, and AWS. Well, Kubernetes is a distributed system so why not deploy and manage Kubernetes with BOSH? Project kubo (Kubernetes on BOSH) was born - a joint collaboration between Google and Pivotal. Finally kubo turned into an official Cloud Foundry Foundation project called Cloud Foundry Container Runtime. CFCR in turn is a major building block of PKS.

With the release of PKS 1.3 Pivotal is supporting vSphere, GCP, Azure, and AWS as infrastructure layer. PKS delivers the same interfaces and admin / user experience independent if you need to manage lots of Kubernetes clusters at scale on-premises or in the cloud.

PKS takes care about life-cycle management of Kubernetes itself. With PKS you can manage fleets of Kubernetes clusters with a very small but highly efficient platform operations team. We at Pivotal strongly believe that is much better to run lot’s of small Kubernetes installations rather than one big. For doing so you need the right tooling but also the right methodology and mindset to do so. Infrastructure as code as well as SRE techniques are mandatory to be effective and scalable. Pivotal supports their customers in that regard by enabling them in our famous Operation Dojos.

PKS as commercial product is a joint development between Pivotal and VMware. With the acquisition of Heptio (a company founded by two of the original Kubernetes creators) VMware has first class knowledge about Kubernetes in house. But let’s go deeper in the PKS components in order to get a better understanding what value PKS provides.

As core component PKS offers a control plane which can be accessed through a simple command line tool called pks. Once PKS is installed through Pivotal’s Ops Manager the control plane is ready. In order to create a new Kubernetes cluster you just need to emit one command: pks create-cluster. What happens is that the control plane creates a new BOSH deployment rolling out a Kubernetes cluster on set of newly allocated VMs or machines. BOSH takes care about keeping all components up and running - if a VM fails it is going to be automatically repaired. Resizing the cluster, i.e. adding more worker nodes or removing worker nodes is just a matter of one single command. But that’a not all, dev and platform ops need logging and monitoring. Logging sinks are forwarding everything which happens on process level to a syslog target. Through a full integration into Wavefront monitoring dashboards can made available with just a few settings. PKS and operating system upgrades can be fully automated and executed without application downtime, Pivotal provides pre-configured CI/CD pipelines and a continuous stream of updates through Pivotal Network. Also part of PKS is VMware’s enterprise ready container registry called Harbor which was recently accepted as official CNCF project with more than 100 contributors. VMwares software defined networking NSX-T (also included in PKS) glues everything seamlessly and very secure together and provides scalable routing and load-balancing functionalities.

There is much more to say but I will stop here, making myself a coffee with my greatly engineered Bezzera, and celebrate the release of PKS 1.3! :-)

Well done!

Unikernels Still Alive? (2019-01-05)

During the widespread adoption of container technologies Unikernels have been seen for a very short while as potentiell successor technology. Docker bought Unikernel Systems in January 2016. After that it has been very quiet around the technology itself. There were some efforts for applying it to high performance computing but also in the HPC community it didn’t get popular. In Japanese compute clusters they experimented with McKernel. In Europe we have an EU funded project called Mikelangelo.

But there is some potential that Unikernels come back again. With the quick adoption of firecracker, a virtualizing technology that is able to start VMs in hundreds of milliseconds, unikernels build into VMs can be potentially orchestrated very well. Of course something which needs to be handcrafted and still put a lot of effort into but the major building blocks exist.

The very well known Unik project (which originated at Dell / EMC btw.) has added already support for firecracker. More information about how it works is described in this blog post.

Unikernels and Containers (2016/06/14)

Thanks to Docker (and Go of course) containers are applied for a broad range of use cases. They became especially important in the services world, for testing and development, and they also start to be interesting in the HPC community having specialized frameworks.

Since a year Unikernels became a topic, too. Until now it looks like the basic development of the frameworks is the main focus. Especially from companies focusing at IoT. I just stumbled over unik which allows to pack Go applications in a Unikernel.

So what are Unikernels and what differs them from containers?

Unikernels are OS machine images which are based on library operating systems. The typical case is that fully fledged operating systems are running and hosting many applications at the same time. With containers you still have the main operating system but through kernel features like namespaces, cgroups, and chroot you can isolate the containers from each other in a way that it feels like running in a different OS.

So with Unikernels you are doing something contrary: instead of sharing the same OS kernel you isolate applications within independent operating systems. Isn’t that what VMs are doing? No, since Unikernels are created for a particular application and hence only provide OS functionalities required for the particular application, they also seem to run in a single address space. Some advantages are certainly that you have real isolation between them (unlike containers), there is no specific kernel version and feature required for the shared OS under-the-hood. Also performance is certainly an aspect when it comes to HPC. No waste of context switches between privileged and unprivileged address space, IO, compute cycles, interrupts for unnecessary OS functionality. Highly specialized OS images can be tuned exactly for one use case and they are much smaller.

But how to personalize OSs without having an immense admin overhead?

This is where the library of the operating systems comes into play. The OSs are finally just created by frameworks during application development. So what we see is a movement from an admin controls everything paradigm on the compute farm (has to install libraries, applications), to devops where developers also has to think about how to ship the application like packing it with its dependencies in containers, to developer-packs-apps-with-dependencies-in-specialized-os.

Again this goes hand-in-hand with cloud computing paradigm where the infrastructure is just available (and as big and flexible as your pocket) and the developers have to deal with it directly. For running it on-premise the hypervisor approach seems to be used because of the underlying difficulty of having the right device drivers.

The Actor Model Greatly Explained (09/02/2016)

Great video about the actor model.

Avoiding the LD_LIBRARY_PATH for Shared Libs in Go (cgo) Applications (2015-12-21)

That Go (#golang) works very well accessing C libraries is widely known. The DRMAA and DRMAA2 Go wrappers for example accessing the native C shared libraries provided by the Grid Engine installation. Those libraries are not installed in the standard libraries paths by default so the Go application using the Go wrapper can only find the shared lib when the LD_LIBRARY_PATH is set to the lib/lx-am64 path of the Grid Engine installation.

But what when you want to ship an application written in Go which uses a shared library and you want to ship the shared library along with your application? Or when you want to force the application to use a shared library which is not in a default library path?

If you not place the shared lib in a system library path and you don't want to use mechanisms like LD_LIBRARY_PATH then you can think about setting the rpath for your application. This is also possible for Go applications. The rpath just adds an library search path in your binary which can be different to standard search paths. When you don't know the absolute path of the application in advance, this is most likely the common case, then you can set the rpath relatively to where your application is stored. This can be done by using the ORIGIN keyword.

Following example demonstrates how this can be done when you compile your Go application.

Let's assume you ship your Go application in a directory structure like this:

myapp/
myapp/bin/myapplication
myapp/lib/mysharedlib.so

The Go myapplication needs the C mysharedlib.so which is in the lib directory ../lib. Then you can build it telling the linker you want to set the rpath to ../lib:

export CGO_LDFLAGS="-L/path/to/the/lib -Wl,-rpath -Wl,\$ORIGIN/../lib"
export CGO_CFLAGS="-I/path/to/include/file/of/the/lib/include"
go build

Inspecting the rpath of the executable can be done in that way:

objdump -p yourapplication | grep RPATH
  RPATH                $ORIGIN/../lib

Installing Grafana on Raspberry Pi 2 (2015-10-24)

Collecting time series data like stock market or home automation data and storing it in InfluxDB is easy using the new Go client library of InfluxDB.

The next step is displaying the data like temperatures as continuously updated graphs.

Grafana is a nice, widely used web interface displaying data as graphs. It works well with InfluxDB but also with other backends like ElasticSearch.

In a previous post I documented how InfluxDB can be installed on the Raspberry Pi. This post summarizes my experience at installing Grafana on the Pi. Also that tool is working reliable since a while on the Pi. It comes with a web-server so you can display the graphs on any device.

For installing Grafana on the Raspberry Pi 2 it needs to be built from scratch (I couldn't find it in a repository for ARM). It appears to consist of Go and node.js. Hence you need to have the Go compiler, git, and nodes.js installed.

Building the Go backend:

$ go get github.com/grafana/grafana
$ cd $GOPATH/src/github.com/grafana/grafana
$ go run build.go setup 
$ $GOPATH/bin/godep restore && go run build.go build

Building the frontend:

 $ sudo aptitude install npm
 $ sudo aptitude install node
 $ sudo npm config set registry http://registry.npmjs.org/
 $ sudo npm install tap

Finally I found myself in the dependency hell of node.js - the Go part was just so simple. IIRC to get grunt installed I needed to update npm as well as install n.

 $ sudo npm install npm -g
 $ sudo npm install n -g
 $ sudo npm install grunt -g

Finally the frontend can be build (you need to be in the grafana directory):

$ sudo npm install

At the beginning it resulted in version compatibility issues of some libraries. What helped is re-installing the required versions of them with sudo npm install tool@version -g and then running the npm install again.

Finally run:

$ grunt --force

If all that was successful Grafana can be started:

$ ./bin/grafana-server 

Then you can login now on port 3000 create your user and organization, adding your InfluxDB data sources, and finally your graphs based on the data you are constantly adding to InfluxDB.

Really a cool thing and a pretty useful tool for displaying all kinds of measurements you are doing.