Kubernetes for Cloud Engineers: Automating TLS with cert-manager

This is part two of a series for new operations and cloud engineers getting into Kubernetes. In part one we covered how to set default TLS certificates on your ingress controllers. That works, but it means you’re managing certs by hand. Let’s fix that.

Why cert-manager?

In the last post, we created TLS secrets manually. That’s fine for getting started. It’s not fine for production. Certificates expire. People forget to renew them. And then you’re getting paged at 3am because your site is serving a browser warning.

cert-manager is a Kubernetes-native certificate management controller. It watches your cluster for resources that need certificates, talks to a Certificate Authority (usually Let’s Encrypt), handles the challenge/response dance, stores the resulting cert as a Kubernetes secret, and renews it before it expires. All automatically. You configure it once and move on with your life.

It’s one of those tools that, once installed, you forget it’s there. That’s the highest compliment I can give infrastructure software.

Installing cert-manager

Two options. Pick whichever matches your deployment style.

Option 1: Helm (recommended)

helm install cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --version v1.20.0 \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

# NAME: cert-manager
# NAMESPACE: cert-manager
# STATUS: deployed

Option 2: Static manifests

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.20.0/cert-manager.yaml

Both methods install the same thing: the cert-manager controller, webhook, cainjector, and six CRDs. Those CRDs are the building blocks you’ll use for everything else.

# Verify everything is running
kubectl get pods -n cert-manager

# NAME                                       READY  STATUS
# cert-manager-5c4b5f7b9-xk2lq              1/1    Running
# cert-manager-cainjector-7f694c-m8p         1/1    Running
# cert-manager-webhook-7cd8c8-9tn2f          1/1    Running

Three pods running. That’s what you want to see.

Issuers: telling cert-manager where to get certificates

cert-manager doesn’t know anything about Let’s Encrypt out of the box. You need to tell it where to go and how to prove you own your domains. That’s what Issuers do.

There are two flavors:

Issuer is namespace-scoped. It can only issue certs for resources in the same namespace.
ClusterIssuer is cluster-scoped. It works across all namespaces.

For most setups, you want a ClusterIssuer. One config, whole cluster. If you have teams that need isolated certificate management per namespace, use an Issuer. But start simple.

Staging first. Always.

Let’s Encrypt has two environments: staging and production. They work identically, but staging issues certificates signed by a fake CA that browsers don’t trust. The certificates themselves are structurally real. They just won’t show a green lock.

Why does this matter? Because Let’s Encrypt production has rate limits. Tight ones. 50 certificates per registered domain per week. 5 duplicate certificates per week. If you mess up your config and keep retrying, you can lock yourself out for days.

Staging has much more generous limits. So you should always get your pipeline working against staging first, then switch to production once you know everything is wired up correctly.

Create a staging ClusterIssuer

# staging-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: you@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
    - http01:
        ingress:
          ingressClassName: nginx

kubectl apply -f staging-issuer.yaml
# clusterissuer.cert-manager.io/letsencrypt-staging created

# Check that it registered successfully
kubectl get clusterissuer letsencrypt-staging

# NAME                  READY  AGE
# letsencrypt-staging   True   30s

READY: True means cert-manager registered an account with Let’s Encrypt staging. If you see False, run kubectl describe clusterissuer letsencrypt-staging and read the events.

Create the production ClusterIssuer

Same thing, different ACME server URL:

# production-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: you@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
    - http01:
        ingress:
          ingressClassName: nginx

Use separate privateKeySecretRef names for staging and production. The ACME accounts are completely independent. You can’t reuse keys between environments.

HTTP-01 vs DNS-01: how you prove domain ownership

When you request a certificate, Let’s Encrypt needs proof that you actually own the domain. There are two ways to do this, and which one you pick depends on your setup.

HTTP-01

Let’s Encrypt sends an HTTP request to http://yourdomain.com/.well-known/acme-challenge/<token>. cert-manager spins up a temporary pod and Ingress (or HTTPRoute) to serve the response. Once validated, the temp resources get cleaned up.

This is the simpler option. It works if:

Port 80 is publicly reachable on your cluster
You don’t need wildcard certificates

It does not work if:

You’re behind a firewall with no public HTTP access
You need *.example.com certs

DNS-01

Instead of an HTTP request, Let’s Encrypt checks for a specific TXT record at _acme-challenge.yourdomain.com. cert-manager talks to your DNS provider’s API to create and clean up the record.

This is the only way to get wildcard certificates. It’s also the right choice for internal domains that aren’t publicly accessible. The tradeoff is more setup. You need to give cert-manager API credentials for your DNS provider, and DNS propagation delays can slow things down.

Built-in DNS providers: Cloudflare, Route53, Google Cloud DNS, Azure DNS, DigitalOcean, Akamai, ACMEDNS, and RFC-2136. There are webhook extensions for 30+ more.

Wiring it up with Ingress

If you’re using the traditional Ingress API, cert-manager makes this really clean. Add an annotation to your Ingress resource, include a tls block, and cert-manager handles the rest.

# my-app-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-staging
spec:
  ingressClassName: nginx
  rules:
  - host: app.example.com
    http:
      paths:
      - pathType: Prefix
        path: /
        backend:
          service:
            name: my-app
            port:
              number: 80
  tls:
  - hosts:
    - app.example.com
    secretName: app-example-com-tls

That’s it. cert-manager sees the annotation, reads the tls block, creates a Certificate resource, kicks off the ACME challenge, and stores the signed certificate in the app-example-com-tls secret. Your ingress controller picks up the secret and starts serving HTTPS.

When the cert is 30 days from expiration, cert-manager renews it automatically.

Staging first, remember. Point the annotation at letsencrypt-staging until you see a valid (but untrusted) cert in your browser. Then swap it to letsencrypt-prod. You’ll need to delete the old secret so cert-manager issues a fresh one from production.

Use cert-manager.io/cluster-issuer for ClusterIssuers and cert-manager.io/issuer for namespace-scoped Issuers. Mix them up and nothing will happen. No error, no cert. Just silence. Ask me how I know.

Wiring it up with Gateway API

If you followed the first post in this series, you know Gateway API is where things are heading. cert-manager supports it, but it needs a little extra configuration.

Enable Gateway API support

Gateway API support is not on by default. You need to opt in.

helm upgrade --install cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --namespace cert-manager \
  --set crds.enabled=true \
  --set config.enableGatewayAPI=true

Make sure the Gateway API CRDs are installed in your cluster too:

kubectl apply --server-side \
  -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml

If you installed the CRDs after cert-manager was already running, restart it so it picks them up:

kubectl rollout restart deployment cert-manager -n cert-manager

Annotate your Gateway

The pattern is similar to Ingress. Add the annotation, reference a secret in the listener’s certificateRefs, and cert-manager fills in the rest.

# gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-gateway
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  gatewayClassName: nginx
  listeners:
  - name: https
    hostname: app.example.com
    port: 443
    protocol: HTTPS
    tls:
      mode: Terminate
      certificateRefs:
      - name: app-example-com-tls

cert-manager sees the annotation and the listener config, creates a Certificate for app.example.com, and stores the result in the app-example-com-tls secret. The Gateway picks it up and starts terminating TLS.

Three things must be true for cert-manager to act on a Gateway listener:

The hostname is not empty
The TLS mode is Terminate (not Passthrough)
There’s at least one entry in certificateRefs

Miss any of those and cert-manager quietly ignores the listener.

Wildcard certificates with DNS-01

This is where DNS-01 earns its keep. Let’s say you want a single cert that covers *.example.com so every subdomain is automatically secured. HTTP-01 can’t do this. Only DNS-01 can.

I’ll use Cloudflare as the DNS provider since it’s common (and because I have a love/hate relationship with them that we’ve already discussed on the software page).

Step 1: Create a Cloudflare API token

In the Cloudflare dashboard, create an API token with these permissions:

Zone / DNS / Edit
Zone / Zone / Read

Scope it to the zone you need. Don’t use a global API key if you can avoid it.

Step 2: Store the token in your cluster

# The secret MUST be in the cert-manager namespace for ClusterIssuers
apiVersion: v1
kind: Secret
metadata:
  name: cloudflare-api-token
  namespace: cert-manager
type: Opaque
stringData:
  api-token: your-cloudflare-api-token-here

That namespace bit is important. When you use a ClusterIssuer, cert-manager looks for referenced secrets in the namespace where cert-manager itself is installed. Not the namespace of your Certificate. Not the namespace of your app. The cert-manager namespace. This trips up everyone at least once.

Step 3: Create a DNS-01 ClusterIssuer

# dns-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod-dns
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: you@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-dns-account-key
    solvers:
    - dns01:
        cloudflare:
          apiTokenSecretRef:
            name: cloudflare-api-token
            key: api-token

Step 4: Request the wildcard cert

# wildcard-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-example-com
  namespace: default
spec:
  secretName: wildcard-example-com-tls
  issuerRef:
    name: letsencrypt-prod-dns
    kind: ClusterIssuer
  dnsNames:
  - "*.example.com"
  - example.com

Include both *.example.com and example.com in the dnsNames. The wildcard only covers subdomains, not the bare domain itself.

kubectl apply -f wildcard-cert.yaml
# certificate.cert-manager.io/wildcard-example-com created

# Watch it work
kubectl get certificate wildcard-example-com -w

# NAME                    READY  SECRET                        AGE
# wildcard-example-com    False  wildcard-example-com-tls      5s
# wildcard-example-com    True   wildcard-example-com-tls      47s

DNS-01 is slower than HTTP-01 because of DNS propagation. Give it a minute or two. If it takes more than five minutes, start troubleshooting (see below).

When things go wrong

They will. Here’s how to figure out what happened.

cert-manager has a chain of resources that it creates when processing a certificate request. Follow the chain from top to bottom:

# 1. Is the Certificate ready?
kubectl get certificates -A

# 2. What does the CertificateRequest say?
kubectl get certificaterequest -A
kubectl describe certificaterequest <name> -n <namespace>

# 3. Is the Issuer healthy?
kubectl get clusterissuer
kubectl describe clusterissuer <name>

# 4. For Let's Encrypt: check the ACME Order and Challenge
kubectl get orders -A
kubectl get challenges -A
kubectl describe challenge <name> -n <namespace>

# 5. Check the cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager --tail=100

Most problems fall into a handful of categories:

Challenge not reachable (HTTP-01). Port 80 isn’t open, or the temporary solver Ingress isn’t being picked up by your controller. Check that ingressClassName in your solver config matches your actual ingress controller.

DNS propagation timeout (DNS-01). The TXT record was created but Let’s Encrypt can’t see it yet. If your cluster’s DNS resolver is slow, you can point cert-manager at public resolvers:

helm upgrade cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --namespace cert-manager \
  --set 'extraArgs={--dns01-recursive-nameservers-only,--dns01-recursive-nameservers=1.1.1.1:53\,9.9.9.9:53}'

Secret in the wrong namespace. The number one DNS-01 headache. ClusterIssuers look for API token secrets in the cert-manager namespace. Not your app namespace. Not default.

Rate limited. You hit Let’s Encrypt production limits. Switch back to staging, wait for the rate limit window to pass (usually 7 days), fix whatever caused the excessive requests, and try again.

Issuer not ready. kubectl describe clusterissuer will tell you why. Usually it’s a bad email address, an unreachable ACME server, or a malformed privateKeySecretRef.

Nothing happens at all. Check the annotation name. cert-manager.io/cluster-issuer is not the same as cert-manager.io/issuer. Using the wrong one for your Issuer type produces zero errors and zero certificates. It’s the most frustrating failure mode because there’s nothing in the logs.

A note on certificate lifetimes

Let’s Encrypt has been shortening certificate lifetimes. They started at 90 days, moved to 64, and are now pushing toward 45-day certificates. This doesn’t matter much if you’re using cert-manager because it handles renewal automatically (default is 30 days before expiration). But it does mean that if cert-manager breaks and you don’t notice, you have less runway before things go red.

Monitor your certificates. At minimum, set up an alert on cert-manager pod health. Ideally, also alert on certificates that haven’t renewed within their expected window.

Wrapping up

Here’s the workflow once everything is configured:

Deploy an Ingress or Gateway with the cert-manager annotation
cert-manager creates a Certificate resource
cert-manager talks to Let’s Encrypt, completes the challenge
The signed certificate lands in a Kubernetes secret
Your ingress controller or gateway picks it up
cert-manager renews it before expiration

No cron jobs. No manual certbot renew. No calendar reminders. Just certificates that work.

If you’re running through this series from the beginning, you now have default TLS certs on your ingress controllers and automated certificate management with Let’s Encrypt. That’s a solid foundation.

Next up in this series: DNS automation with external-dns. Because if cert-manager removes the manual work from certificates, external-dns does the same thing for DNS records.