Skip to main content

Debugging in Kratix

· 12 min read
Sapphire Mason-Brown
Engineer @ Syntasso

As much as we would all like, rolling out updates to any software can result in some bumps along the way. This applies to updates to Promises too but Kratix has some feature to help identify any issues within your Promise spec, your Promise workflows and the scheduling of documents outputted by your workflows.

In this blog post we'll explore some of the common issues that users experience when configuring Kratix and developing Promises and well as how Kratix tries to steer you in the right direction when something goes wrong. We'll be exploring:

  • Querying Kratix effectively with labels
  • Debugging scheduling issues Kratix
  • Getting information from Destination and State Store status updates
  • Validating the Kratix Promise spec

Click on "read more" to continue!

The end goal

We'll be working with the Runtime Promise which deploys a Deployment configured with Nginx. By making updates to the Promise and Kratix resources, we'll highlight some common problems and the breadcrumbs you can follow to solve them.

You can follow the steps in this post and debug in your own environment. If you want to do this, start with the Runtime Promise here.

A central property someone making a request of this Promise needs to provide is the image for their Deployment. To optimise this Promise, we want to add a new Pipeline step that performs a security scan of the provided image and outputs the result as a HealthRecord in Kratix. To get started, we'll deploy the Promise to our testing environment.

Getting Started

The first thing we want to do is set up a kratix environment, to get started quickly we'll be deploying Kratix on Kind clusters via some helper scripts in the Kratix repo. If you're playing along, clone the Kratix repository and run:

make quick-start
make prepare-platform-as-destination

Debugging Scheduling in Kratix

If you're playing along, you can clone the Runtime Promise by running:

git clone --depth=1 https://github.com/syntasso/kratix-docs.git runtime-promise
cd runtime-promise
git sparse-checkout set assets/runtime-promise --no-cone

Let's install the Promise with:

kubectl apply -f promise.yaml

The runtime promise allows users to deploy an Application Runtime as a service via a Resource Request where they can edit the lifecycle, image, servicePort and the number of replicas in their deployment.

The lifecycle field determines which Destination to schedule the workloads to and this maps to the label environment=${lifecycle} on the Destinations. Whilst working on the Promise, we want to deploy it to a testing Destination so the request will look as follows:

apiVersion: marketplace.kratix.io/v1alpha1
kind: Runtime
metadata:
name: example-runtime
namespace: default
spec:
lifecycle: testing
image: syntasso/website
servicePort: 80
replicas: 1

Create a file example-runtime.yaml with these contents and apply this request with kubectl apply -f example-runtime.yaml.

We can query for the pods created as part of the workflow with selectors that are added to workflow pods by default, this is particularly useful in busy environments with a lot of running pods:

kubectl get pods --selector kratix.io/promise-name=runtime,\
kratix.io/workflow-type=resource,\
kratix.io/workflow-action=configure,\
kratix.io/resource-name=example-runtime

The output should look something like this:

NAME                                                  READY   STATUS      RESTARTS   AGE
kratix-runtime-example-runtime-instance-24bcb-ffh7w 0/1 Completed 0 20m

As the workflow has finished running, we can check to ensure the documents were scheduled to the testing Destination. Like workflow pods, Works are created with a set of default labels to make querying for Works associated with given Promises and Resource Requests easier. Run the following to get the Work associated with the example-runtime resource:

kubectl get work --selector kratix.io/resource-name=example-runtime -o yaml

The status of the Work shows that it has not been scheduled:

Status:
Conditions:
- lastTransitionTime: "2025-03-06T07:00:45Z"
message: 'No Destinations available work WorkloadGroups: [ae2b1fca515949e5d54fb22b8ed95575]'
reason: UnscheduledWorkloadGroups
status: "False"
type: Scheduled
- lastTransitionTime: "2025-03-06T07:00:45Z"
message: WorkGroups that have been scheduled are at the correct Destination(s)
reason: ScheduledToCorrectDestinations
status: "False"
type: Misscheduled

What does this mean? Essentially, there were no Destinations matching the label environment=testing in our environment. Lets review the available Destinations and their labels with:

kubectl get destinations --show-labels

This produces:

NAME       READY   LABELS
platform True environment=platform
worker-1 True environment=dev

There is no destination with the environment=testing label and as a result, the documents could not be scheduled. Lets create the testing Destination. To do this we will:

  1. Create a new Cluster
  2. Create a backing State Store for the cluster
  3. Create a new Destination

As we are running on kind, we can can create a new cluster by running:

kind create cluster --image kindest/node:v1.31.2 --name worker-2
export WORKER_2="kind-worker-2"

Next, we need to ensure GitOps tooling is available on the new cluster. This can be quickly aided by the Kratix repo again, from the root of the repo, run the following:

./scripts/install-gitops --context ${WORKER_2} --path worker-2

Our quick start has configured minio on the cluster so we can use the minio endpoint within the BucketStateStore. Run the following to create the BucketStateStore:

cat <<EOF > testing-bucket.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: BucketStateStore
metadata:
name: testing
spec:
authMethod: accessKey
bucketName: kratix
endpoint: minio.kratix-platform-system.svc.cluster.local
insecure: true
secretRef:
name: minio
namespace: default
status: {}
EOF

kubectl apply -f testing-bucket.yaml --context kind-platform

Now we can create the Destination that is backed by this State Store:

cat <<EOF > testing-destination.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: Destination
metadata:
labels:
environment: testing
name: testing
spec:
cleanup: none
filepath:
mode: nestedByMetadata
stateStoreRef:
kind: BucketStateStore
name: testing
status: {}
EOF

kubectl apply -f testing-destination.yaml --context kind-platform

After applying both of these we should set that we have a new running BucketStateStore and Destination. However, when running the following:

kubectl get destinations.platform.kratix.io testing --context kind-platform

we observe that the testing Destination is not Ready

NAME      READY
testing False

Similarly, when querying the BucketStateStore with:

kubectl get BucketStateStore testing --context kind-platform

we can also see that the State Store is not Ready:

NAME      READY
testing False

Why is this the case? Lets kubectl describe the testing Destination. Run:

kubectl describe destination testing --context kind-platform

The status of the Destination includes some conditions which detail why it is not yet ready:

Status:
Conditions:
Last Transition Time: 2025-03-05T11:56:05Z
Message: Unable to write test documents to State Store
Reason: StateStoreWriteFailed
Status: False
Type: Ready

This is reiterated by an event that was fired:

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning DestinationNotReady 20m DestinationController Failed to write test documents to Destination "testing": secret "minio" not found in namespace "default"

When creating both Destinations and State Stores, Kratix checks to see that the defined locations can be written to with the provided credentials before marking them as Ready. We see a similar status and event fired for the testing BucketStateStore:

Status:
Conditions:
Last Transition Time: 2025-03-05T11:30:07Z
Message: Error initialising writer: secret "minio" not found in namespace "default"
Reason: ErrorInitialisingWriter
Status: False
Type: Ready
Status: NotReady
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NotReady 52m BucketStateStoreController BucketStateStore "testing" is not ready: Error initialising writer: secret "minio" not found in namespace "default"

This means that no work can be scheduled to these destinations until the problems are remedied, so lets fix the issue. Our minio credential isn't quite right, we need to edit the testing BucketStateStore to update the name of the secretRef from minio to minio-credentials. Update the BucketStateStore and in just a few moments, both the State Store and Destination will become Ready

Status:
Conditions:
Last Transition Time: 2025-03-05T12:28:31Z
Message: State store is ready
Reason: StateStoreReady
Status: True
Type: Ready
Status: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NotReady 59m BucketStateStoreController BucketStateStore "testing" is not ready: Error initialising writer: secret "minio" not found in namespace "default"
Normal Ready 65s BucketStateStoreController BucketStateStore "testing" is ready
Status:
Conditions:
Last Transition Time: 2025-03-05T12:28:31Z
Message: Test documents written to State Store
Reason: TestDocumentsWritten
Status: True
Type: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning DestinationNotReady 34m DestinationController Failed to write test documents to Destination "testing": secret "minio" not found in namespace "default"
Normal Ready 114s DestinationController Destination "testing" is ready

Now that the Destination is up and Healthy, we can see that the work has been scheduled successfully:

status:
conditions:
- lastTransitionTime: "2025-03-06T07:08:03Z"
message: All WorkloadGroups scheduled to Destination(s)
reason: ScheduledToDestinations
status: "True"
type: Scheduled

And, more importantly, our example-runtime app is up and running. We can visit it at http://example-runtime.default.local.gd:31338

An image of the running Runtime App in the browser
Runtime App

Now that it's deployed successfully, we're ready to build on the Runtime Promise and add the security scan as a new step when configuring resource requests.

We can bootstrap this step with the Kratix CLI's add container command. From the root of the Runtime Promise directory, you can run:

kratix add container resource/configure/instance \
--image ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0 \
--name security-scan

This command adds a new container to the existing resource configure workflow with the name security-scan and the image ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0.

Your directory structure should now look like this:

├── example-resource.yaml
├── promise.yaml
└── workflows
└── resource
└── configure
└── instance
├── deploy-resources
│   ├── Dockerfile
│   ├── resources
│   │   ├── postgres-request-template.yaml
│   │   ├── redis-request-template.yaml
│   │   └── runtime-request-template.yaml
│   └── scripts
│   └── pipeline.rb
└── security-scan
├── Dockerfile
├── resources
└── scripts
└── pipeline.sh

You'll also see an addition to the promise.yaml, appending the security-scan container to the list of containers in the resource-configure Pipeline

  workflows:
resource:
configure:
- apiVersion: platform.kratix.io/v1alpha1
kind: Pipeline
metadata:
name: resource-configure
spec:
containers:
- resource-configure
image: ghcr.io/syntasso/kratix-docs/runtime-configure-pipeline:v0.1.0
name: resource-configure
- image: ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0
name: security-scan

Next, we need to bring this image into existence. Update the new pipeline.sh file in the security-scan directory to look like this:

#!/usr/bin/env sh

set -euxo pipefail

image="$(yq eval '.spec.image' /kratix/input/object.yaml)"

echo "Scanning ${image}"

if [ $DEBUG = "true" ]; then
DEBUG_MODE=true
echo "Running in debug mode"
else
DEBUG_MODE=false
fi

TRIVY_DEBUG=$DEBUG_MODE trivy image --format=json --output=results.json "${image}" > results.json

health_state="healthy"

if [ "$(jq '.[] | select(.Vulnerabilities != null) | length' results.json)" != "" ]; then
health_state="degraded"
fi

resource_name=$(yq '.metadata.name' /kratix/input/object.yaml)
namespace="default"

mkdir -p /kratix/output/platform/

cat <<EOF > /kratix/output/platform/health-record.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: HealthRecord
metadata:
name: rubyapp-${resource_name}
namespace: ${namespace}
data:
promiseRef:
name: rubyapp
resourceRef:
name: ${resource_name}
namespace: ${namespace}
state: ${health_state}
lastRun: $(date +%s)
details:
results: ""
EOF

cat results.json | yq -P > results.yaml
yq e -i '.data.details.results = load("results.yaml")' /kratix/output/platform/health-record.yaml

cat <<EOF > /kratix/metadata/destination-selectors.yaml
- directory: platform
matchLabels:
environment: platform
EOF

This script retrieves the image specified in the request, scans it with trivy and outputs a HealthRecord detailing the results.

To use Trivy, we also need to update the generated Dockerfile to install the Trivy CLI. Update your Dockerfile to look like this:

FROM "alpine"

RUN apk update && apk add --no-cache yq curl jq

RUN curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin v0.18.3

ADD scripts/pipeline.sh /usr/bin/pipeline.sh
ADD resources resources

RUN chmod +x /usr/bin/pipeline.sh

CMD [ "sh", "-c", "pipeline.sh" ]
ENTRYPOINT []

To ensure the new security-scan image is are available on the kind clusters, we need to build it and load it onto the kind node. Run the following from the runtime-promise/workflows/resource/configure/instance/security-scan directory:

docker build . --tag ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0
kind load docker-image ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0 --name platform

Whilst testing this command, we'll run in debug mode should any issues come up. Our script has already been configured to allow this to be set via an environment variable.

Add the following lines to the Promise spec for the newly introduced container:

- image: ghcr.io/syntasso/kratix-docs/trivy-scan:v1.0.0
name: security-scan
envs:
- name: DEBUG
value: "true"

We're nearly ready to install the Promise!

Identifying invalid workflows

Before you can install your Promise, Kratix ensures that your Promise has valid Workflows definitions that can be used to generate your workflow pods. Apply your updated promise with:

kubectl apply -f promise.yaml

You should see a message that includes

json: unknown field "envs"

We have a slight typo in the Promise spec for the new workflow, envs should be env. Correct this and apply the promise again.

Re-applying the Promise will trigger the Workflows for the example-runtime request which should now generate a HealthRecord with the security scan results. Run:

kubectl get healthrecord --context kind-platform

and you should see output similar to:

NAME                      STATUS    AGE
runtime-example-runtime degraded 8m12s

Also, as the healthrecord reference the example-runtime request, the request should be updated to reflect the results in the HealthRecord. Run:

kubectl describe runtime example-runtime --context kind-platform

And the should see something similar to the following:

  Health Record:
Details:
Results:
...
Last Run: 1741248436
State: degraded

Great! The updated Runtime Promise is now running a provisioning requests that

Overview

We've explored some of the common stumbling blocks that can come up when working wih Kratix - issues with scheduling, configuring Destinations - and the features of Kratix you can use to as debugging tools.

Many of these are new features we've introduced following feedback from our customers so if there is a gotcha that has caught you out in the past, let us know via Github or our Community Slack - we're alway happy to hear from users.