rCTF Docs
Overview

Kubernetes instancer

Deploy the rCTF Kubernetes instancer on GKE with the bundled Terraform modules and operator.

The Kubernetes instancer is the scalable backend for per-team challenge instances. An operator runs inside the cluster, turning a ChallengeInstance custom resource into a namespace, network policies, deployments, services, and Traefik routes.

The rCTF API only talks to the Kubernetes API server through the instancer/k8s-instancer provider. The operator handles every other moving piece.

Warning (Hostile workloads)

Challenge images run untrusted code. The defaults assume strict isolation, but new variables, wider RBAC, or any other “sensitive” config changes can quickly break that assumption.

Architecture#

A working deployment has three cooperating components:

ComponentSourceResponsibility
K8sInstancerProviderapps/api/src/providers/instancer/k8s-instancer.tsTranslates rCTF lifecycle calls into create, get, patch, and delete operations on ChallengeInstance custom resources.
k8s-controllerapps/k8s-controller/Go operator built with controller-runtime that watches ChallengeInstance and reconciles the cluster state.
Traefikdeploy/terraform/instancer/modules/k8s/traefik.tfHelm-installed ingress controller that terminates TLS and routes wildcard hostnames to per-instance services.

A participant request flows through these in order:

rCTF creates the custom resource

The API receives PUT /api/v2/integrations/challs/:id/instance, validates the challenge config with the provider schema, then creates a cluster-scoped ChallengeInstance resource in the rctf-instancer.osec.io/v1 API group. The CR carries the challenge ID, team ID, expiry, pod specs, and expose entries.

The controller reconciles

The operator watches ChallengeInstance events and runs its reconciliation loop. It adds the rctf.osec.io/finalizer finalizer, then creates a namespace, network policies, deployments, services, and Traefik IngressRoute or IngressRouteTCP resources for each expose entry.

Traefik routes participant traffic

Each expose entry gets a hostname of the form <hostPrefix>-<uid>.<instancer-host>. Traefik matches the hostname against the generated IngressRoute and forwards traffic to the per-instance service.

The controller cleans up at expiry

When time.Now() passes spec.expiresAt, the controller deletes the ChallengeInstance. The deletion timestamp triggers the finalizer, which deletes the namespace and removes the finalizer once the namespace is gone. Manual deletion through the rCTF API follows the same path.

Namespaces are deterministic and named inst-<challenge-id>-<team-id> so the controller can find them across restarts. Every child resource inherits owner references from the ChallengeInstance, so cluster-level garbage collection acts as a safety net behind the explicit finalizer.

Prerequisites#

The Terraform example assumes GKE plus Cloudflare for DNS and ACME. GCP Cloud DNS works as a drop-in alternative.

RequirementNotes
GCP project with billing enabledUsed for GKE, Artifact Registry, and optionally Cloud DNS.
gcloud and kubectlNeeded for cluster auth.
terraform 1.5+The example pins providers but not the Terraform CLI.
kind (optional)Only required for local controller development.
Domain plus DNS providerOne of Cloudflare or GCP Cloud DNS. Used for the ACME DNS-01 challenge and the wildcard A record.
Let’s Encrypt account emailRegistered through the acme_registration resource.

The instancer’s public hostname is <instancer_subdomain>.<instancer_zone> (or just <instancer_zone> when no subdomain is set). All per-instance hostnames live under a wildcard one level below.

Controller image#

The operator image is published at ghcr.io/otter-sec/rctf-new/k8s-controller, and the matching install.yaml ships in the repo at apps/k8s-controller/dist/install.yaml. The Terraform k8s module reads that file directly and substitutes the configured hostname into the INSTANCER_HOST placeholder, so there’s nothing to build or push before running terraform apply.

Terraform variables#

The example terraform.tfvars lives in deploy/terraform/instancer/example/:

  • deploy/
    • terraform/
      • instancer/
        • example/
          • main.tf Providers, GKE module wiring
          • dns.tf Cloudflare or GCP Cloud DNS record
          • tls.tf ACME wildcard certificate and Traefik TLSStore
          • rctf-instancer.tf k8s module call and rCTF ServiceAccount
          • variables.tf Input variables
          • terraform.tfvars.example Example values
        • modules/
          • gke/GKE cluster and Artifact Registry
          • k8s/Traefik, error pages, controller installer

Copy terraform.tfvars.example to terraform.tfvars and fill in:

VariableRequiredPurpose
cloudflare_api_tokenCloudflare onlyAPI token with Zone.DNS edit on the configured zone. Used for ACME DNS-01 and the wildcard record.
letsencrypt_email_addressYesAddress registered with Let’s Encrypt for the wildcard certificate.
instancer_zoneYesBase apex domain, for example ctf.example.com.
instancer_subdomainYesOptional subdomain in front of instancer_zone. Set to an empty string when serving instances from the apex.
ctf_nameYesDisplayed on the 404 and 502 error pages rendered by the in-cluster Nginx deployment.
gcp_dns_managed_zone_nameCloud DNS onlyManaged-zone name when using google_dns_record_set instead of Cloudflare.
gcp_project_idYesGCP project that hosts the GKE cluster and Artifact Registry.
gcp_regionYesGKE control-plane region, for example us-central1.
gcp_zoneYesSingle zone for the node pool. Set to the same value as gcp_region for a multi-zone cluster. The Artifact Registry location is derived from this with a ^(.*)-[a-z]$ regex.
gcp_instancer_cluster_nameYesGKE cluster name and Artifact Registry service account ID.
gcp_instancer_machine_typeYesGCE machine type, for example e2-medium.
gcp_instancer_min_node_countNoAutoscaling minimum. Defaults to 1.
gcp_instancer_max_node_countNoAutoscaling maximum. Defaults to 1.
gcp_instancer_pod_pids_limitNoPer-pod kubelet PID cap. Defaults to 1024. Bounds fork bombs in challenge pods so one container can’t drain the node’s kernel.pid_max. Must be >= 1024 per GKE’s kubelet validation.

A minimal Cloudflare-backed file looks like this:

terraform.tfvars
cloudflare_api_token = "<cloudflare-api-token>"
letsencrypt_email_address = "ops@example.com"
instancer_zone = "ctf.example.com"
instancer_subdomain = "instances"
ctf_name = "Example CTF"
gcp_project_id = "example-ctf"
gcp_region = "us-central1"
gcp_zone = "us-central1-a"
gcp_instancer_cluster_name = "rctf-instancer"
gcp_instancer_machine_type = "e2-standard-4"
gcp_instancer_min_node_count = 1
gcp_instancer_max_node_count = 8

To use GCP Cloud DNS instead of Cloudflare, comment out the Cloudflare blocks in dns.tf and tls.tf, uncomment the google_dns_record_set and gcloud ACME blocks, and set gcp_dns_managed_zone_name.

Deployment#

Initialize Terraform
Terminal window
cd deploy/terraform/instancer/example
cp terraform.tfvars.example terraform.tfvars
$EDITOR terraform.tfvars
terraform init
Apply the stack
Terminal window
terraform apply

Terraform provisions GKE, the node pool, Artifact Registry, the Cloudflare or Cloud DNS record, the ACME wildcard certificate, Traefik, the error-pages deployment, the rctf service account, and applies the bundled apps/k8s-controller/dist/install.yaml (pointing at the prebuilt ghcr.io/otter-sec/rctf-new/k8s-controller image). The first apply typically takes 10 to 15 minutes. ACME validation alone can add a few minutes if DNS propagation is slow.

Fetch kubectl credentials
Terminal window
gcloud container clusters get-credentials rctf-instancer --project example-ctf --location us-central1
kubectl get pods -n rctf-instancer-controller-system

The controller pod should be Running. Traefik comes up in the traefik namespace, with the wildcard certificate stored in the instancer-wildcard-tls Kubernetes Secret.

Wire the outputs into rCTF

Three Terraform outputs map directly to provider options:

Terraform outputrCTF optionEnvironment override
rctf_instancer_api_urlinstancerProvider.options.apiUrlK8S_INSTANCER_API_URL
rctf_instancer_auth_tokeninstancerProvider.options.authTokenK8S_INSTANCER_AUTH_TOKEN
rctf_instancer_ca_certificateinstancerProvider.options.caCertificateK8S_INSTANCER_CA_CERTIFICATE

Render them into rCTF’s rctf.d/:

Terminal window
terraform output -raw rctf_instancer_api_url
terraform output -raw rctf_instancer_auth_token
terraform output -raw rctf_instancer_ca_certificate
rctf.d/instancer.yaml
instancerProvider:
name: instancer/k8s-instancer
options:
apiUrl: https://203.0.113.10
authToken: <rctf_instancer_auth_token>
caCertificate: |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----

caCertificate is required even when the API server certificate is already trusted by the host.

Verify end-to-end

Create an instanced challenge that uses the instancer/k8s-instancer provider and start it as a participant. The controller should create the inst-<challenge-id>-<team-id> namespace, and Traefik should serve the <hostPrefix>-<uid>.<instancer-host> hostname over HTTPS.

What Terraform provisions#

The example layers the GKE module, the k8s module, and the example-level resources in rctf-instancer.tf, dns.tf, and tls.tf:

ResourceSourcePurpose
GKE cluster (google_container_cluster)modules/gke/gke.tfStable release channel, workload identity, COS_CONTAINERD nodes, weekly Wednesday 07:00 to 19:00 UTC maintenance window.
GKE primary node poolmodules/gke/gke.tfAutoscaling between min_node_count and max_node_count, surge upgrades, optional preemptible nodes, workload metadata concealment.
GCP service accountmodules/gke/gke.tfIdentity for cluster nodes. Granted roles/artifactregistry.writer on the challenge registry.
Artifact Registry repo (google_artifact_registry_repository)modules/gke/registry.tfDocker registry named challenge-registry. Keeps the five most recent versions, deletes images older than 30d.
Traefik (helm_release.traefik)modules/k8s/traefik.tfLoadBalancer service with externalTrafficPolicy: Local to preserve client IPs, plus the dashboard entrypoint for kubectl port-forward.
Nginx error pagesmodules/k8s/traefik.tfkubernetes_deployment_v1.error-pages plus a ConfigMap rendering 404 and 502 templates with ctf_name.
Traefik Middleware and catch-all IngressRoutemodules/k8s/traefik.tfMiddleware intercepts 502 errors and serves the Nginx page. The catch-all HostRegexp(.*) route returns the 404 page for unmatched hosts.
Controller installer (kubectl_manifest)modules/k8s/rctf-instancer.tfApplies every manifest in apps/k8s-controller/dist/install.yaml, replacing INSTANCER_HOST with the resolved hostname.
ACME wildcard certificate (acme_certificate)example/tls.tfDNS-01 challenge through Cloudflare or Cloud DNS. The chain and key land in the instancer-wildcard-tls Secret in the traefik namespace.
Traefik TLSStore (kubectl_manifest)example/tls.tfSets instancer-wildcard-tls as the default certificate for the cluster.
Wildcard DNS record (cloudflare_dns_record)example/dns.tf*.<subdomain> A record pointing at the Traefik LoadBalancer IP. The GCP variant uses google_dns_record_set.
rctf service account, ClusterRole, ClusterRoleBinding, and Secretexample/rctf-instancer.tfService account in kube-system with verbs create, get, delete, and patch on challengeinstances.rctf-instancer.osec.io. A long-lived kubernetes.io/service-account-token secret backs the rctf_instancer_auth_token output.

Traefik is configured with three ports:

PortEntry pointPurpose
80webHTTP routes plus global 502 middleware.
443websecureHTTPS routes terminated with the ACME wildcard.
1337tcpRaw TCP with SNI routing for tcp-ssl expose kinds.

The wildcard certificate is provisioned manually instead of through cert-manager, so DNS provider credentials never have to live inside the cluster. The blast radius of a cluster compromise stays limited to whatever certificates Terraform has already issued.

Network policies#

The controller creates three NetworkPolicy resources in every instance namespace:

PolicyPod selectorBehavior
isolate-namespaceAll podsIngress is restricted to other pods in the same managed namespace. Egress is restricted to pods in the same namespace plus UDP 53 to the kube-system namespace for DNS.
ingress-traefikrctf.osec.io/exposed=trueAllows ingress from Traefik pods in the traefik namespace. Applied only to pods that match an expose[].containerName entry.
egressrctf.osec.io/egress=trueAllows egress to 0.0.0.0/0 except RFC1918 (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), CGNAT (100.64.0.0/10), and link-local (169.254.0.0/16).

The exposed label is applied automatically based on whether a pod is named by any expose[] entry. The egress label comes from the per-pod egress: true flag in instancerConfig. Challenges that shouldn’t reach the internet leave it false.

Note (Cluster network plugin)

Network policies only enforce isolation when the cluster’s CNI supports them. GKE’s default dataplane enforces them. On a bare-metal cluster, make sure the chosen CNI honors NetworkPolicy.

Per-pod safety checklist#

Unlike Docker, Kubernetes’ PodSpec doesn’t have first-class fields for every resource cap, and the controller deploys what you give it verbatim. Set these on every pod you ship through instancerConfig.config.pods[]:

SettingWhy it matters
resources.limits.cpu / resources.limits.memoryWithout limits a pod can use whatever the node has free. Set them per-container.
resources.limits.ephemeral-storageA participant dd-ing /tmp/ fills the node’s overlayfs and takes down every other pod sharing it. Pick a value (128Mi is reasonable for most challenges).
securityContext.readOnlyRootFilesystem set to truePairs with the ephemeral-storage limit. If the challenge needs to write somewhere, mount a sized emptyDir with sizeLimit.
securityContext.allowPrivilegeEscalation set to false and dropped capabilitiesDefaults are unsafe. Drop ALL and only add what the challenge actually needs.
automountServiceAccountToken set to false on the podOtherwise the default service-account token gets mounted into the container.
terminationGracePeriodSecondsCap it (e.g. 10) so held TCP connections don’t delay pod cleanup for minutes when an instance expires.
volumes[].emptyDir.sizeLimitAny emptyDir mount needs a size cap or the same disk-fill issue applies.
Warning (File descriptor / nofile limits)

Kubernetes’ PodSpec has no first-class ulimits field. The CRI passes container ulimits through containerd’s default_ulimits, which on GKE COS_CONTAINERD nodes isn’t exposed via the kubelet config. Changing it takes a custom node startup script or DaemonSet that rewrites /etc/containerd/config.toml.

If your challenge is sensitive to FD exhaustion, the practical workaround is to set the limit in the entrypoint:

Dockerfile entrypoint
#!/bin/sh
ulimit -n 1024
exec /your/challenge "$@"

This is per-image, not platform-enforced, so it’s only as strong as the image. Don’t rely on it for hostile-input boundaries that absolutely must not break. Reach for a per-connection sandbox (nsjail) instead.

RBAC and the rCTF service account#

The example creates a single ServiceAccount named rctf in kube-system and a matching ClusterRole granting only what the API needs:

kubernetes_cluster_role_v1.rctf
rule:
api_groups: ['rctf-instancer.osec.io']
resources: ['challengeinstances']
verbs: ['create', 'get', 'delete', 'patch']

The rCTF API never reads or writes any other resource type. The kubernetes_secret_v1.rctf_token resource issues a kubernetes.io/service-account-token so the token doesn’t rotate. The value comes back through the rctf_instancer_auth_token Terraform output.

The controller itself runs with its own RBAC from apps/k8s-controller/config/. It needs broad permissions on namespaces, deployments, services, network policies, and Traefik IngressRoute, IngressRouteTCP, and Middleware resources so it can reconcile per-instance objects. The CRD lives in apps/k8s-controller/config/crd/bases/ and is generated by make manifests.

Example challenge config (Konata)#

A complete instanced challenge as it would live in a Konata deployment repo. This is the web/mirror-temple config from the DiceCTF Quals 2026 challenge repository. Konata builds and pushes the image, then forwards instancer_config straight to rCTF, which hands it to the k8s-instancer provider.

web/mirror-temple/kona.yml
challenges:
- category: web
name: mirror-temple
author: arcblroth
description: |
stare long enough at the void and the void stares back
attachments:
files:
- 'Dockerfile'
- 'chall/src/'
flags:
rctf:
file: flag.txt
instancer_config:
challenge_integration_id: '{{ challenge.name }}'
timeout_milliseconds: 1800000
extendable: true
expose:
- kind: https
host_prefix: '{{ challenge.name }}'
container_name: app
container_port: 8080
config:
pods:
- name: app
egress: true
ports:
- protocol: TCP
name: http-service
port: 8080
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 0
automountServiceAccountToken: false
containers:
- name: app
image: '{{ images[challenge.name] }}'
ports:
- containerPort: 8080
resources:
requests:
cpu: '500m'
memory: '500Mi'
limits:
cpu: '3'
memory: '2Gi'
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
deployment:
images:
- build_context: .
name: '{{ challenges[0].name }}'
tag: latest
registry_name: instancer-challenges
platform: linux/amd64

Things worth pointing at in this example:

  • egress: true on the pod opts it into the egress NetworkPolicy so the challenge can reach the public internet. Drop it for challenges that should be sealed off.
  • Resource requests and limits are mandatory in practice. The controller schedules the pod normally, so an unset limit lets a single instance starve the node. Size them to the per-team load you expect at peak.
  • readinessProbe keeps Traefik from routing to the pod before the app is up. Without it, the first request after creation often 502s while the container is still booting.
  • securityContext locks the container down (read-only root FS, dropped capabilities, no root). The k8s-instancer namespace is already isolated by the per-namespace NetworkPolicy set, but a tight pod-level context is the second layer.
  • {{ images[challenge.name] }} resolves to the fully-qualified registry path Konata pushed to (registries.instancer-challenges + the image name + tag).
  • flags.rctf.file: flag.txt lets the flag live in a sibling file Konata reads at sync time, so the challenge directory stays self-contained.

For the rest of the Konata schema, see Konata.

Local development with Kind#

For controller iteration, the README in apps/k8s-controller/ uses Kind. Routing inside the cluster needs cloud-provider-kind so that LoadBalancer services get an external IP.

Install Kind and the cloud provider shim
Terminal window
go install sigs.k8s.io/cloud-provider-kind@latest

Install Kind itself from its quick-start guide.

Create the cluster
Terminal window
cd apps/k8s-controller
kind create cluster --name rctf --config kind-config.yaml

The bundled kind-config.yaml spins up one control plane and one worker node.

Run cloud-provider-kind

Leave this running in a separate session for the duration of development:

Terminal window
cloud-provider-kind
Apply the surrounding stack

Point the Terraform example at the local Kind context. In deploy/terraform/instancer/example/main.tf switch from the GCP-backed kubernetes, helm, and kubectl providers to the commented-out kind-rctf blocks, then apply:

Terminal window
cd deploy/terraform/instancer/example
terraform apply
Install the CRD and run the controller
Terminal window
cd apps/k8s-controller
make install
make run ARGS="-instancer-host instancer.test"

make install applies the CRDs from config/crd, and make run runs the controller against the current kubectl context with -instancer-host setting the hostname suffix.

Test with a sample ChallengeInstance
Terminal window
kubectl apply -f config/sample/rctf-instancer_v1_challengeinstance.yaml

The controller logs the reconciliation flow, and the sample’s namespace, service, and IngressRoute should appear. Tear the cluster down with kind delete cluster --name rctf when you’re done.

Troubleshooting#

SymptomLikely cause
terraform apply hangs on acme_certificateDNS propagation for the DNS-01 record is slow. Verify the Cloudflare or Cloud DNS TXT record is visible from a public resolver.
Controller pod CrashLoopBackOff after installThe image reference in dist/install.yaml can’t be pulled. The default is ghcr.io/otter-sec/rctf-new/k8s-controller. Verify the node has registry access and the tag still exists.
Instances stuck in startingInspect the ChallengeInstance status conditions with kubectl get challengeinstance -A -o yaml. The NamespaceDeployed, DeploymentsDeployed, and ServicesDeployed conditions narrow down the failing stage.
502 from the wildcard hostTraefik is reachable but the backing pod isn’t ready. The global-errors middleware serves the Nginx 502 page until the deployment reports ready replicas.
404 on the wildcard hostThe catch-all IngressRoute matched. Confirm an active ChallengeInstance exists for the hostname and that its IngressRoute has a higher priority than 1.
rCTF returns 400 badInstancerConfigThe challenge config failed the provider’s Zod schema. Fetch the schema from /api/v2/admin/instancer/schema and validate the challenge manifest against it.
Namespace stuck TerminatingA child resource still holds a finalizer. The controller waits one second per reconcile while the namespace drains. Check Traefik CRDs in the namespace if the wait doesn’t resolve.

The controller exposes Kubernetes events through standard kubectl describe output. Pair kubectl describe challengeinstance <name> with the controller logs (kubectl logs -n rctf-instancer-controller-system -l control-plane=controller-manager) to trace down any reconciliation failure.

Esc

Start typing to search the docs.