Part 1: The Core Architecture (The “What”)
Q: What exactly is v1beta1.external.metrics.k8s.io?
A: It is a Kubernetes APIService.
In a standard Kubernetes cluster, the API Server (kube-apiserver) handles requests for native resources like Pods, Services, and Deployments. However, Kubernetes is designed to be extensible. The APIService resource tells the main API Server:
“I do not know how to handle requests for
external.metrics.k8s.io. Instead of rejecting them, please proxy (forward) any request matching this path to a specific backend service running inside the cluster.”
In your case, that backend service is the Datadog Cluster Agent.
Q: Why was the status “Healthy” even though my HPA was failing?
A: The status True / Available in the APIService output merely confirms network connectivity.
The Kubernetes Aggregation Layer periodically pings the registered backend (Datadog Cluster Agent Service on port 8443).
- The Check: “Can I establish a TCP connection and get a 200 OK from the
/or/healthzendpoint?” - The Result: “Yes.”
- The Misconception: This status does not check if the agent can actually fetch data, if your API keys are valid, or if your metric definitions are correct. It only confirms the “pipe” is open, not that the water is clean.
Part 2: The Workflow (The “How”)
Q: What happens step-by-step when an HPA requests a metric?
A: The flow is complex and involves multiple hops:
- HPA Controller Loop: The Horizontal Pod Autoscaler (HPA) controller (part of
kube-controller-manager) wakes up (default every 15-30s). - API Request: The HPA sees it needs a metric named
proxy-dynamic-min-replicas. It sends a GET request to the main API Server:GET /apis/external.metrics.k8s.io/v1beta1/namespaces/kouzoh-scenario-test-jp-dev/proxy-dynamic-min-replicas - Aggregation Layer Proxy: The main API Server looks up its APIService table, sees that
external.metrics.k8s.iobelongs to Datadog, and forwards the request to thedatadog-cluster-agentPod on port 8443. - Agent Processing: The Datadog Cluster Agent receives the request. It must now translate this Kubernetes metric name into a Datadog Query (e.g.,
avg:nginx.requests{...}). - Datadog Backend Fetch: The Agent uses its API key to query Datadog HQ (SaaS).
- Response: The value (e.g.,
0.5) is returned to the Agent, then to the API Server, and finally to the HPA Controller.
Part 3: The Problem (The “Why”)
Q: What caused the “DatadogMetric not found” error?
A: This was a State Collision issue caused by the dcaautogen feature.
1. The “Autogen” Feature:
When you create an HPA referencing an external metric before you create a specific DatadogMetric CRD, the Cluster Agent tries to be helpful. It automatically generates a DatadogMetric object internally to handle the request. It names these with a specific pattern: dcaautogen-<hash>.
2. The Cache: The Cluster Agent maintains an in-memory map:
- Key:
external_metric_name(e.g.,proxy-dynamic-min-replicas) - Value:
internal_object_id(e.g.,dcaautogen-54b5...)
3. The Collision:
- At some point in the past, an HPA requested this metric. The Agent generated an
autogenmapping for it. - Later, you deployed a valid, manual
DatadogMetricobject (the one we saw withkubectl get datadogmetrics). - The Bug: The Agent’s cache was “sticky.” It continued to map the metric name
proxy-dynamic-min-replicasto the olddcaautogenID instead of switching to your new, manualDatadogMetric. - Since the old
dcaautogenobject likely didn’t exist anymore (or was invalid), the lookup failed with “DatadogMetric not found,” even though the manual one was right there.
Q: Why did kubectl get datadogmetrics show “Active: True”?
A: Because that command only queries the Kubernetes etcd database.
- The
DatadogMetricCustom Resource was valid. - The Operator/Controller was successfully updating its status.
- However, the Metrics Server component (the part that answers the HPA’s questions) was looking at a different (stale) reference pointer.
Part 4: The Solution (The “Fix”)
Q: Why does restarting the Pod fix it?
A: Restarting the datadog-cluster-agent pod kills the process and clears its RAM.
When the new Pod starts:
- The memory is empty.
- It scans the cluster for all existing
DatadogMetricobjects. - It builds a fresh mapping table.
- It sees your manual
proxy-dynamic-min-replicasobject and correctly maps the name to that object ID. - The stale
dcaautogenreference is gone forever.
Q: How do I prevent this in the future?
A:
- Order of Operations: Always deploy the
DatadogMetricCRD before or at the same time as the HPA. Do not deploy an HPA that references a metric that doesn’t exist yet, or the Agent might try to “autogen” it. - Disable Autogen (Optional): If you always use manual
DatadogMetricobjects (which is best practice for production), you can disable the autogen feature entirely by setting the environment variableDD_EXTERNAL_METRICS_PROVIDER_ENABLE_DATADOGMETRIC_AUTOGENtofalsein your Cluster Agent deployment. This prevents it from ever creating these “ghost” mappings.
Part 5: Verification Command Explained
Q: What did the kubectl get --raw command do?
A: kubectl get --raw is the ultimate “truth serum” for Kubernetes APIs.
- Standard
kubectl get hpaasks the HPA controller for its status. It tells you what the HPA thinks is happening (which might be “I can’t fetch metrics”). kubectl get --raw "/apis/external..."acts like acurlrequest directly to the API endpoint. It bypasses all controllers and caches. It forces the API server to attempt a live fetch from the Datadog Agent right now.- If this command returns JSON, the pipe is working. If it returns an Error (as it did for you), the issue is definitely in the backend (Datadog Agent), not in the HPA configuration.