Full notes: GKE Compute Classes - Detailed Explanation β†’

Key Concepts

What Are Compute Classes?

Compute Classes are a GKE Custom Resource (CRD) that define priority-ordered sets of node pools with autoscaling settings. A ComputeClass spec declares priority-based node pool selection, fallback behavior when preferred resources are unavailable, and active migration to automatically move workloads to higher-priority nodes when capacity returns. When a pod selects a ComputeClass via nodeSelector with cloud.google.com/compute-class, GKE walks the priority list top-to-bottom, scheduling on the highest-priority pool available and falling back as needed.

Cluster Implementation (citadel-dev Example)

The citadel-2g-dev-tokyo-01 cluster defines three custom ComputeClasses:

ComputeClassWorkload TypePriority 1 (Cheapest)Priority 2 (Fallback)Priority 3 (Last Resort)
citadel-default-ccSharedt2d-32 Spot (2 pools)t2d-32 On-demandn2d-32 Spot
citadel-mercari-api-ccMercari APIt2d-16 Spot (2 pools)t2d-32 On-demandn2d-32 Spot
citadel-mercari-eaas-ccElasticsearcht2d-32 Spott2d-32 On-demandn2d-64 Spot

Each node pool carries a label (cloud.google.com/compute-class: <name>) and a NoSchedule taint matching the compute class name, ensuring only pods that tolerate the taint land on those nodes. The API and EaaS pools also carry additional taints like node-pool-id and machine-series for further isolation. All three classes set activeMigration.optimizeRulePriority: true to auto-migrate pods back to cheaper pools when spot capacity recovers.

How Workloads Use Compute Classes

Method 1 β€” Explicit Node Selector: Workloads set nodeSelector: cloud.google.com/compute-class: <name> and a matching NoSchedule toleration in their pod spec. Example: Elasticsearch workloads explicitly select citadel-mercari-eaas-cc.

Method 2 β€” Automatic Assignment via KubeMod: KubeMod ModRules automatically inject compute class selectors and tolerations into pods that lack them. For citadel-default-cc, four ModRules cover cases: no affinity at all, affinity but no nodeAffinity, nodeAffinity but no required rules, and only availability-type affinity. For citadel-mercari-api-cc, two ModRules match pods with nodeAffinity or nodeSelector targeting d-mercari-api.

Cost Optimization Strategy

The priority system encodes a cost ladder: Spot T2D VMs first (cheapest), On-demand T2D second (reliable fallback), Spot N2D third (different availability pool as a last resort). Active migration ensures workloads drift back to cheaper tiers automatically without manual intervention.

Architecture

                    citadel-2g-dev-tokyo-01
                              |
         +--------------------+--------------------+
         |                    |                    |
  citadel-default-cc   citadel-mercari-api-cc  citadel-mercari-eaas-cc
   (Shared Pools)        (Dedicated API)        (Elasticsearch)
         |                    |                    |
  P1: t2d Spot         P1: t2d-16 Spot      P1: t2d-32 Spot
  P2: t2d On-demand    P2: t2d-32 OD        P2: t2d-32 OD
  P3: n2d Spot         P3: n2d-32 Spot      P3: n2d-64 Spot

Quick Reference

ComponentLocation
Node Pool Definitionsmicroservices-terraform/.../terragrunt.hcl
ComputeClass CRDsmicroservices-kubernetes/.../ComputeClass/
KubeMod ModRulesmicroservices-kubernetes/.../ModRule/
ES Compute Class Configkit/pkg/elasticsearch/elasticsearch-spec.cue

Pod assignment flow:

MethodHow
ExplicitnodeSelector: cloud.google.com/compute-class: <name> + matching toleration
AutomaticKubeMod ModRules inject selector/toleration for pods without affinity rules

Key Takeaways

  • Compute Classes are a cost optimization tool β€” they encode a preference order from cheapest to most reliable node pools
  • Each node pool gets a label and a NoSchedule taint matching the compute class name, ensuring pods only land on intended pools
  • Active migration means automatic cost savings without manual intervention when spot capacity fluctuates
  • KubeMod handles the β€œdefault” case so teams don’t need to add compute class config to every workload
  • Dedicated pools (API, EaaS) carry extra taints for workload isolation beyond just the compute class
  • The microservices-ci repo has no compute class configs β€” CI/CD pipelines apply the Terraform and K8s manifests that contain the definitions