Flue, Part 3: Self-Hosting Agents with Knative

Part 1 was the framework pitch. Part 2 was the agent I wanted to build. This one is for the people who can’t (or won’t) ship that agent to Cloudflare Workers.

The reasons vary. Compliance says all customer data stays in our VPC. Security says no third-party serverless on the data path. Procurement won’t sign another vendor contract this fiscal year. We already run Kubernetes for everything else, so why are we paying someone to schedule containers for us? All valid.

Self-hosting a Flue agent on Kubernetes is straightforward. The trick is doing it without rebuilding the things managed serverless gives you for free: HTTP routing, autoscaling, scale-to-zero, ingress, TLS. That’s where Knative earns its keep.

Why Knative

Knative is the K8s-native answer to serverless. You hand it a container image and a port. It hands you back a public URL, request-based autoscaling, scale-to-zero when nothing is calling, and TLS through cert-manager. The whole “deploy a webhook handler” thing collapses into a single Kubernetes resource.

For a Flue agent specifically:

The agent boots in a few hundred milliseconds. Cold start is fine.
Webhook traffic is bursty. Scale-to-zero saves real money.
Inbound traffic is HTTP only. No need for a service mesh.

OpenFAAS works too. So does spinning up a Deployment plus Service plus Ingress plus HPA by hand. I keep coming back to Knative because the Service resource is one file and the operational story (events, observability, autoscaling) is built in.

The Dockerfile

Flue ships as a Node CLI plus a function bundle. For deployment, build a small image with the function code and the SDK’s HTTP server in front:

FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --omit=dev

FROM node:22-alpine
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY agent.ts jira.ts ./
COPY skills ./skills
ENV NODE_ENV=production
EXPOSE 8080
CMD ["npx", "flue", "serve", "--host", "0.0.0.0", "--port", "8080", "agent.ts"]

Two-stage build to keep the runtime image small. The skills directory comes along since the agent reads context files at runtime. flue serve is the SDK’s HTTP entrypoint that turns export default into a webhook handler at /.

If you’re running a Bun-flavored Flue agent, swap the base image for oven/bun:1-alpine and the CMD for bun run agent.ts. Same shape, fewer layers.

Push to whatever registry you’re using:

docker build -t registry.internal/agents/jira-triage:0.1.0 .
docker push registry.internal/agents/jira-triage:0.1.0

The Knative Service

The Service resource is the whole thing.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: jira-triage
  namespace: agents
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "10"
        autoscaling.knative.dev/target: "5"
    spec:
      timeoutSeconds: 90
      containerConcurrency: 1
      containers:
        - image: registry.internal/agents/jira-triage:0.1.0
          ports:
            - containerPort: 8080
          envFrom:
            - secretRef:
                name: jira-triage-secrets
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 1Gi

The annotations are doing real work:

min-scale: 0 lets the agent drop to zero replicas when nothing is calling it. That is the entire point of running this on Knative instead of a plain Deployment.
max-scale: 10 caps blast radius. If a JIRA automation rule misfires and floods the webhook, it won’t take down the cluster.
target: 5 plus containerConcurrency: 1 tell Knative each pod handles one in-flight request at a time, and to scale up once five concurrent requests are queued.

timeoutSeconds: 90 is the upper bound for a single request. Most agent runs finish in 10 to 30 seconds. Ninety gives a comfortable cushion for the slow case where the model is grinding through a long classification context. If you regularly need longer than that, you’ve left “webhook handler” territory and want a workflow runner like Argo or a queue.

containerConcurrency: 1 is the safe default. Flue agents tend to hold an open session and a model connection per request. Sharing one Node process across multiple in-flight model calls works, but the concurrency math gets complicated and you trade simplicity for a small CPU win. Start at 1, tune later if cost matters.

Secrets

Secrets stay out of the Service manifest. Drop them in a Secret and envFrom them in:

apiVersion: v1
kind: Secret
metadata:
  name: jira-triage-secrets
  namespace: agents
type: Opaque
stringData:
  ANTHROPIC_API_KEY: "sk-ant-..."
  JIRA_HOST: "yourorg.atlassian.net"
  JIRA_EMAIL: "triage-bot@yourorg.com"
  JIRA_TOKEN: "..."

If you’re running External Secrets Operator or Vault Secrets Operator, point at the upstream store instead of stuffing keys into a YAML you’ll commit by accident. The Flue agent doesn’t care where the env vars come from. It just reads them.

The webhook URL

Knative gives the Service a public URL on whatever domain your cluster operator has configured. You’ll see it in kubectl get ksvc -n agents:

kubectl get ksvc jira-triage -n agents

NAME          URL                                            READY
jira-triage   https://jira-triage.agents.example.internal    True

Point your JIRA automation rule at that URL with whatever auth you want on it. Knative supports request authentication through the standard ingress controllers, so if you want HMAC verification, JWT, or a static bearer token, the Service is just an HTTP server and you wire it in upstream.

For HMAC (JIRA can sign webhooks), I add a small middleware layer in the Flue handler itself rather than at the ingress. That way the Service stays portable and the auth lives next to the agent code.

Scaling, the operator’s view

Knative does its own pod scaling, but the autoscaler still has to live somewhere. The defaults are reasonable, and if your team is already running KPA (the Knative Pod Autoscaler), you don’t need to do anything. If you’d rather use HPA, swap the annotation:

autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"

KPA reacts faster to bursts (sub-second). HPA is steadier and integrates with your existing custom metrics pipeline. For agent workloads I’d default to KPA: bursts of webhook traffic show up first as concurrency, which KPA scales on natively.

What you give up

Cold starts. The first request after a quiet period will pay 1 to 3 seconds for the Node process to come up. That’s fine for JIRA webhooks. It would be painful for a chatbot facing humans. If you can’t tolerate cold starts, set min-scale: 1 and pay for one always-warm pod.

Operational ownership. Cloudflare hands you “the function ran, here’s the log, here’s the trace.” With Knative you own the autoscaler tuning, the ingress, the cert rotation, the registry, and the K8s upgrade cycle. If your team already runs that, the marginal cost is small. If not, you might be reaching for the wrong tool.

Egress quotas. Knative defaults to no egress filtering. If you’re in a regulated environment, you’ll want NetworkPolicies in front of the agent’s namespace so it can only reach the model API and your JIRA host. The default is “open egress,” which is probably fine for the prototype and definitely not fine for prod.

Observability

The agent already returns structured JSON for each invocation (the action it took, the reasoning summary, the issue key). On Knative, every request shows up in kubectl logs and the queue-proxy sidecar exposes Prometheus metrics:

revision_request_count per Service revision
request_latency histogram
concurrent_requests gauge

Pair that with a ServiceMonitor and you get a dashboard for free. Add a structured logger inside the agent (pino, console.log of structured objects, whatever your cluster’s log pipeline can parse) and you’ve got correlated logs and metrics without any custom infra.

For traces: enable Knative’s OpenTelemetry support, or bolt on the OpenTelemetry SDK inside the Flue handler. Whichever your tracing stack already prefers.

When this is the right tool

Self-host the Flue agent on Knative when:

Your data plane has to stay in your VPC.
You already run Kubernetes and have an SRE team that knows it.
You have multiple agents and want a uniform deployment story across them all.
Webhook traffic is bursty enough that “always running on a VM” is overkill, but consistent enough that you’d rather not pay per-request to a third party.

Reach for managed serverless when:

You don’t have a K8s team, or you have one and they’re already at capacity.
You want one less thing to babysit.
The data is fine to leave the building.

Both shapes work. Both are real. The Flue agent code doesn’t change between them, which is the part I keep finding satisfying: you can develop on Workers, prove the agent out in the small, and self-host the same code in your cluster the day someone in security says “no, that data stays put.”