Debugging in Kratix
As much as we would all like, rolling out updates to any software can result in some bumps along the way. This applies to updates to Promises too but Kratix has some feature to help identify any issues within your Promise spec, your Promise workflows and the scheduling of documents outputted by your workflows.
In this blog post we'll explore some of the common issues that users experience when configuring Kratix and developing Promises and well as how Kratix tries to steer you in the right direction when something goes wrong. We'll be exploring:
- Querying Kratix effectively with labels
- Debugging scheduling issues Kratix
- Getting information from Destination and State Store status updates
- Validating the Kratix Promise spec
Click on "read more" to continue!
The end goal
We'll be working with the Runtime Promise which deploys a Deployment configured with Nginx. By making updates to the Promise and Kratix resources, we'll highlight some common problems and the breadcrumbs you can follow to solve them.
You can follow the steps in this post and debug in your own environment. If you want to do this, start with the Runtime Promise here.
A central property someone making a request of this Promise needs to provide is the image for their Deployment. To optimise this Promise, we want to add a new Pipeline step that performs a security scan of the provided image and outputs the result as a HealthRecord in Kratix. To get started, we'll deploy the Promise to our testing environment.
Getting Started
The first thing we want to do is set up a kratix environment, to get started quickly we'll be deploying Kratix on Kind clusters via some helper scripts in the Kratix repo. If you're playing along, clone the Kratix repository and run:
make quick-start
make prepare-platform-as-destination
Debugging Scheduling in Kratix
If you're playing along, you can clone the Runtime Promise by running:
git clone --depth=1 https://github.com/syntasso/kratix-docs.git runtime-promise
cd runtime-promise
git sparse-checkout set assets/runtime-promise --no-cone
Let's install the Promise with:
kubectl apply -f promise.yaml
The runtime promise allows users to deploy an Application Runtime as a service via a Resource Request where they can edit the lifecycle
, image
, servicePort
and the number of replicas
in their deployment.
The lifecycle
field determines which Destination to schedule the workloads to and this maps to the label environment=${lifecycle}
on the Destinations. Whilst working on the Promise, we want to deploy it to a testing
Destination so the request will look as follows:
apiVersion: marketplace.kratix.io/v1alpha1
kind: Runtime
metadata:
name: example-runtime
namespace: default
spec:
lifecycle: testing
image: syntasso/website
servicePort: 80
replicas: 1
Create a file example-runtime.yaml
with these contents and apply this request with kubectl apply -f example-runtime.yaml
.
We can query for the pods created as part of the workflow with selectors that are added to workflow pods by default, this is particularly useful in busy environments with a lot of running pods:
kubectl get pods --selector kratix.io/promise-name=runtime,\
kratix.io/workflow-type=resource,\
kratix.io/workflow-action=configure,\
kratix.io/resource-name=example-runtime
The output should look something like this:
NAME READY STATUS RESTARTS AGE
kratix-runtime-example-runtime-instance-24bcb-ffh7w 0/1 Completed 0 20m
As the workflow has finished running, we can check to ensure the documents were scheduled to the testing
Destination. Like workflow pods, Works are created with a set of default labels to make querying for Works associated with given Promises and Resource Requests easier. Run the following to get the Work associated with the example-runtime
resource:
kubectl get work --selector kratix.io/resource-name=example-runtime -o yaml
The status of the Work shows that it has not been scheduled:
Status:
Conditions:
- lastTransitionTime: "2025-03-06T07:00:45Z"
message: 'No Destinations available work WorkloadGroups: [ae2b1fca515949e5d54fb22b8ed95575]'
reason: UnscheduledWorkloadGroups
status: "False"
type: Scheduled
- lastTransitionTime: "2025-03-06T07:00:45Z"
message: WorkGroups that have been scheduled are at the correct Destination(s)
reason: ScheduledToCorrectDestinations
status: "False"
type: Misscheduled
What does this mean? Essentially, there were no Destinations matching the label environment=testing
in our environment. Lets review the available Destinations and their labels with:
kubectl get destinations --show-labels
This produces:
NAME READY LABELS
platform True environment=platform
worker-1 True environment=dev
There is no destination with the environment=testing
label and as a result, the documents could not be scheduled. Lets create the testing
Destination. To do this we will:
- Create a new Cluster
- Create a backing State Store for the cluster
- Create a new Destination
As we are running on kind, we can can create a new cluster by running:
kind create cluster --image kindest/node:v1.31.2 --name worker-2
export WORKER_2="kind-worker-2"
Next, we need to ensure GitOps tooling is available on the new cluster. This can be quickly aided by the Kratix repo again, from the root of the repo, run the following:
./scripts/install-gitops --context ${WORKER_2} --path worker-2
Our quick start has configured minio on the cluster so we can use the minio endpoint within the BucketStateStore. Run the following to create the BucketStateStore:
cat <<EOF > testing-bucket.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: BucketStateStore
metadata:
name: testing
spec:
authMethod: accessKey
bucketName: kratix
endpoint: minio.kratix-platform-system.svc.cluster.local
insecure: true
secretRef:
name: minio
namespace: default
status: {}
EOF
kubectl apply -f testing-bucket.yaml --context kind-platform
Now we can create the Destination that is backed by this State Store:
cat <<EOF > testing-destination.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: Destination
metadata:
labels:
environment: testing
name: testing
spec:
cleanup: none
filepath:
mode: nestedByMetadata
stateStoreRef:
kind: BucketStateStore
name: testing
status: {}
EOF
kubectl apply -f testing-destination.yaml --context kind-platform
After applying both of these we should set that we have a new running BucketStateStore and Destination. However, when running the following:
kubectl get destinations.platform.kratix.io testing --context kind-platform
we observe that the testing
Destination is not Ready
NAME READY
testing False
Similarly, when querying the BucketStateStore with:
kubectl get BucketStateStore testing --context kind-platform
we can also see that the State Store is not Ready:
NAME READY
testing False
Why is this the case? Lets kubectl describe
the testing
Destination. Run:
kubectl describe destination testing --context kind-platform
The status
of the Destination includes some conditions which detail why it is not yet ready:
Status:
Conditions:
Last Transition Time: 2025-03-05T11:56:05Z
Message: Unable to write test documents to State Store
Reason: StateStoreWriteFailed
Status: False
Type: Ready
This is reiterated by an event that was fired:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning DestinationNotReady 20m DestinationController Failed to write test documents to Destination "testing": secret "minio" not found in namespace "default"
When creating both Destinations and State Stores, Kratix checks to see that the defined locations can be written to with the provided credentials before marking them as Ready
. We see a similar status
and event
fired for the testing
BucketStateStore:
Status:
Conditions:
Last Transition Time: 2025-03-05T11:30:07Z
Message: Error initialising writer: secret "minio" not found in namespace "default"
Reason: ErrorInitialisingWriter
Status: False
Type: Ready
Status: NotReady
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NotReady 52m BucketStateStoreController BucketStateStore "testing" is not ready: Error initialising writer: secret "minio" not found in namespace "default"
This means that no work can be scheduled to these destinations until the problems are remedied, so lets fix the issue. Our minio credential isn't quite right, we need to edit the testing
BucketStateStore to update the name of the secretRef from minio
to minio-credentials
. Update the BucketStateStore and in just a few moments, both the State Store and Destination will become Ready
Status:
Conditions:
Last Transition Time: 2025-03-05T12:28:31Z
Message: State store is ready
Reason: StateStoreReady
Status: True
Type: Ready
Status: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NotReady 59m BucketStateStoreController BucketStateStore "testing" is not ready: Error initialising writer: secret "minio" not found in namespace "default"
Normal Ready 65s BucketStateStoreController BucketStateStore "testing" is ready
Status:
Conditions:
Last Transition Time: 2025-03-05T12:28:31Z
Message: Test documents written to State Store
Reason: TestDocumentsWritten
Status: True
Type: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning DestinationNotReady 34m DestinationController Failed to write test documents to Destination "testing": secret "minio" not found in namespace "default"
Normal Ready 114s DestinationController Destination "testing" is ready
Now that the Destination is up and Healthy, we can see that the work has been scheduled successfully:
status:
conditions:
- lastTransitionTime: "2025-03-06T07:08:03Z"
message: All WorkloadGroups scheduled to Destination(s)
reason: ScheduledToDestinations
status: "True"
type: Scheduled
And, more importantly, our example-runtime
app is up and running. We can visit it at http://example-runtime.default.local.gd:31338

Now that it's deployed successfully, we're ready to build on the Runtime Promise and add the security scan as a new step when configuring resource requests.
We can bootstrap this step with the Kratix CLI's add container
command. From the root of the Runtime Promise directory, you can run:
kratix add container resource/configure/instance \
--image ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0 \
--name security-scan
This command adds a new container to the existing resource configure workflow with the name security-scan
and the image ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0
.
Your directory structure should now look like this:
├── example-resource.yaml
├── promise.yaml
└── workflows
└── resource
└── configure
└── instance
├── deploy-resources
│ ├── Dockerfile
│ ├── resources
│ │ ├── postgres-request-template.yaml
│ │ ├── redis-request-template.yaml
│ │ └── runtime-request-template.yaml
│ └── scripts
│ └── pipeline.rb
└── security-scan
├── Dockerfile
├── resources
└── scripts
└── pipeline.sh
You'll also see an addition to the promise.yaml
, appending the security-scan
container to the list of containers in the resource-configure
Pipeline
workflows:
resource:
configure:
- apiVersion: platform.kratix.io/v1alpha1
kind: Pipeline
metadata:
name: resource-configure
spec:
containers:
- resource-configure
image: ghcr.io/syntasso/kratix-docs/runtime-configure-pipeline:v0.1.0
name: resource-configure
- image: ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0
name: security-scan
Next, we need to bring this image into existence. Update the new pipeline.sh
file in the security-scan
directory to look like this:
#!/usr/bin/env sh
set -euxo pipefail
image="$(yq eval '.spec.image' /kratix/input/object.yaml)"
echo "Scanning ${image}"
if [ $DEBUG = "true" ]; then
DEBUG_MODE=true
echo "Running in debug mode"
else
DEBUG_MODE=false
fi
TRIVY_DEBUG=$DEBUG_MODE trivy image --format=json --output=results.json "${image}" > results.json
health_state="healthy"
if [ "$(jq '.[] | select(.Vulnerabilities != null) | length' results.json)" != "" ]; then
health_state="degraded"
fi
resource_name=$(yq '.metadata.name' /kratix/input/object.yaml)
namespace="default"
mkdir -p /kratix/output/platform/
cat <<EOF > /kratix/output/platform/health-record.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: HealthRecord
metadata:
name: rubyapp-${resource_name}
namespace: ${namespace}
data:
promiseRef:
name: rubyapp
resourceRef:
name: ${resource_name}
namespace: ${namespace}
state: ${health_state}
lastRun: $(date +%s)
details:
results: ""
EOF
cat results.json | yq -P > results.yaml
yq e -i '.data.details.results = load("results.yaml")' /kratix/output/platform/health-record.yaml
cat <<EOF > /kratix/metadata/destination-selectors.yaml
- directory: platform
matchLabels:
environment: platform
EOF
This script retrieves the image
specified in the request, scans it with trivy
and outputs a HealthRecord
detailing the results.
To use Trivy, we also need to update the generated Dockerfile to install the Trivy CLI. Update your Dockerfile to look like this:
FROM "alpine"
RUN apk update && apk add --no-cache yq curl jq
RUN curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin v0.18.3
ADD scripts/pipeline.sh /usr/bin/pipeline.sh
ADD resources resources
RUN chmod +x /usr/bin/pipeline.sh
CMD [ "sh", "-c", "pipeline.sh" ]
ENTRYPOINT []
To ensure the new security-scan image is are available on the kind clusters, we need to build it and load it onto the kind node. Run the following from the runtime-promise/workflows/resource/configure/instance/security-scan
directory:
docker build . --tag ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0
kind load docker-image ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0 --name platform
Whilst testing this command, we'll run in debug mode should any issues come up. Our script has already been configured to allow this to be set via an environment variable.
Add the following lines to the Promise spec for the newly introduced container:
- image: ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0
name: security-scan
envs:
- name: DEBUG
value: "true"
We're nearly ready to install the Promise!
Identifying invalid workflows
Before you can install your Promise, Kratix ensures that your Promise has valid Workflows definitions that can be used to generate your workflow pods. Apply your updated promise with:
kubectl apply -f promise.yaml
You should see a message that includes
json: unknown field "envs"
We have a slight typo in the Promise spec for the new workflow, envs
should be env
. Correct this and apply the promise again.
Re-applying the Promise will trigger the Workflows for the example-runtime
request which should now generate a HealthRecord with the security scan results. Run:
kubectl get healthrecord --context kind-platform
and you should see output similar to:
NAME STATUS AGE
runtime-example-runtime degraded 8m12s
Also, as the healthrecord reference the example-runtime
request, the request should be updated to reflect the results in the HealthRecord. Run:
kubectl describe runtime example-runtime --context kind-platform
And the should see something similar to the following:
Health Record:
Details:
Results:
...
Last Run: 1741248436
State: degraded
Great! The updated Runtime Promise is now running a provisioning requests that
Overview
We've explored some of the common stumbling blocks that can come up when working wih Kratix - issues with scheduling, configuring Destinations - and the features of Kratix you can use to as debugging tools.
Many of these are new features we've introduced following feedback from our customers so if there is a gotcha that has caught you out in the past, let us know via Github or our Community Slack - we're alway happy to hear from users.