Making our controllers more observable is something we have been working on for a while. The first step in the journey was to introduce more descriptive and consistent status and events across all Kratix-owned resources (for example, check the Promise status and events). Together with additional printer columns for the kubectl get command, this represented a great improvement to the user experience of operating a platform that uses Kratix.
side note: I often say destination statuses were the best feature we ever implemented in Kratix. Time has been proving me right:
The next problem we wanted to tackle was logs. Whenever an error occurred, our customers would need to go through a painful journey sifting through hundreds of lines of logs to attempt to find the one line that contained the right error. Kratix output a range of logs at different levels, all unstructured. We often caught ourselves on customer support calls just to point out to them that the error is right there, line 5675 of the logs. We needed to do better.
So back in June 2025 the team at Syntasso got together to explore how we could make discovering what went wrong when their platform is not behaving the way they expected easier. The results of that session were a series of guidelines and best practices for logging that we should follow when building out controllers. In this blog post, we will go through those guidelines, how we implemented them, and the results we have seen so far.
Rolling out changes across a platform fleet
is one of the hardest parts of being a platform engineer. Move too fast, and you risk availability. Move too slowly, and the platform becomes a bottleneck.
Imagine you are responsible for maintaining a suite of production services. Some support apps that are more business critical
than others - maybe they are customer facing, or back time-sensitive business functions. You receive a notification of a vulnerability that you need
to patch. How do you ensure that upgrades to those business critical services are successful and won’t cause downtime or outages?
As platforms are so unique to organisations, and the services required by platform users are so vast, Kratix has always been very
flexible when it comes to the design of Promises. Other than the expectation that Promise writers honour the Promise schema, the
scope for how Promises can be designed is vast; containers can be written in any language, workflows can be as segmented as you
would like and workflow actions can be imperative or declarative.
However, there are a number design practices and approaches to Promise development that can make development and maintenance
easier for Promise developers and make consuming services via Promises better for users.
In this blog post, we're going design a Promise with some core fundamentals in mind, paving the way for improved debugging,
reliability and user clarity.
These SDKs excel at everything from bootstrapping the initial Promise for your organisation's platform to ensuring consistency and testability across all the Promises your platform teams create.
Defining and building platform APIs and services is no easy task in general. Platform engineers who adopt Kratix and
create bespoke services – which we call “Promises” – often
encounter a significant "wall of bash" when starting their journey towards taming their underlying IaC config, YAML, and API calls.
Navigating Kubernetes resources can often feel like searching for a needle in a
haystack, especially when trying to gain a comprehensive overview of everything
deployed within your cluster. While kubectl is a familiar tool, we've heard that
it can be complicated to visualise the entire landscape of Kratix Promises and
Resources.
To address this, we've been developing a powerful new solution: a SKE plugin
for Headlamp, designed to provide unparalleled observability for your Kubernetes
environments.
Earlier this year, we introduced an exciting new capability in Kratix: health checks for resources.
This addition allows platform teams and app developer to easily observe the status of their requested workloads, without the need to switch context and find it in Destinations.
In this blog, we’ll discuss how you can use it to support progressive rollouts when updating Promises.
When a Promise gets updated, say with a new version of a Helm chart, the standard behavior in Kratix is to reconcile and update all resource requests at once.
That’s fine in simple dev environments, but for complex workloads, upgrading everything at once is risky.
A failed update could disrupt many environments simultaneously, and debugging becomes difficult when failures are scattered.
Platform engineers need a safer approach: progressive rollouts. Instead of deploying changes to your entire fleet at once,
progressive rollout allows teams to introduce updates gradually to limit the impact, gather early feedback,
and catch potential bugs before releasing broadly.
But for that to work, Kratix needs a way to understand the health of each individual resource during and after an upgrade.
The newly released version of the SKE Backstage plugins will no longer rely on a Git repository to perform CRUD operations on the Kubernetes Cluster. Instead, they will now use the Kubernetes API to manage the resources directly. This enables users to get the most up-to-date information on their resources, as well as manage resources created via other means, like via kubectl.
One important aspect is that you can now use the plugins with OIDC providers, allowing you to have finer control over the authentication and authorisation process.
In this blog, we'll go through the process of setting up your Kubernetes Cluster with Keycloak, and configuring Backstage to use it for authentication. We will then configure the SKE Backstage plugins to use the OIDC token provided by Keycloak.
As much as we would all like, rolling out updates to any software can result in some bumps along the way. This applies to updates to Promises too but Kratix has some feature to help identify any issues within your Promise spec, your Promise workflows and the scheduling of documents outputted by your workflows.
In this blog post we'll explore some of the common issues that users experience when configuring Kratix and developing Promises and well as how Kratix tries to steer you in the right direction when something goes wrong. We'll be exploring:
Querying Kratix effectively with labels
Debugging scheduling issues Kratix
Getting information from Destination and State Store status updates
So you read the guide on Compound Promises and tried out the Workshop, and decided that a compound promise is the right abstraction to expose in your platform. You are about to start writing it, but you are still wondering how you would really go about writing one.
We hear you.
In this blog post, we will build a Compound Promise from scratch. Consider this the ultimate guide on how to build compound promises effectively.
You can follow this guide and build the Promise along with us, or you can use it as a reference when building your own Compound Promises. The Promise we will build is available here.
After reading this post you will:
Learn about some basic Kratix concepts
Learn how to write a Compound Promise
By transforming an user's request into a series of sub-requests
By sending those sub-requests to the Platform cluster (and why you need it)
By defining the sub-Promises that the parent Promise depends on