Kubernetes Cluster Autoscaling: Thiết Kế Tự Động Co Giãn Node Pool An Toàn Cho Production

Kubernetes cluster autoscaling không chỉ là “thiếu node thì tự thêm node”. Trong production, autoscaling liên quan trực tiếp đến chi phí cloud, độ ổn định ứng dụng, thời gian khởi động workload, giới hạn quota, PodDisruptionBudget, topology zone và cả bảo mật node. Nếu cấu hình hời hợt, cluster có thể scale chậm khi traffic tăng, scale quá tay làm cháy ngân sách, hoặc không scale được vì pod pending do constraint sai.

Bài này đi theo góc nhìn sysadmin/SRE vận hành Kubernetes trên cloud như AWS EKS, GKE, AKS hoặc cluster tự quản. Mục tiêu là thiết kế autoscaling dễ hiểu, có guardrail, có lệnh kiểm tra cụ thể và có checklist nghiệm thu trước khi đưa vào production.

1. Hiểu đúng ba tầng autoscaling trong Kubernetes

Autoscaling trong Kubernetes thường có ba tầng. Mỗi tầng giải quyết một bài toán khác nhau, không nên dùng lẫn lộn.

Horizontal Pod Autoscaler (HPA): tăng/giảm số pod dựa trên CPU, memory hoặc custom metrics.
Vertical Pod Autoscaler (VPA): khuyến nghị hoặc tự điều chỉnh request/limit của pod.
Cluster Autoscaler/Karpenter: tăng/giảm số node hoặc provision node mới khi pod không có chỗ chạy.

Nói ngắn gọn: HPA xử lý số lượng pod, VPA xử lý kích thước pod, còn cluster autoscaling xử lý dung lượng hạ tầng.

2. Bối cảnh production mẫu

Giả sử bạn vận hành một nền tảng API có traffic tăng mạnh vào giờ cao điểm. Cluster có ba nhóm workload:

Frontend/API: cần scale nhanh, ưu tiên latency thấp.
Worker queue: có thể scale chậm hơn, chịu được gián đoạn ngắn.
System add-on: CoreDNS, ingress controller, monitoring, logging phải luôn có dung lượng ổn định.

Node pool production nên tách ít nhất thành: system node pool cho add-on, general node pool cho app phổ thông, và specialized node pool cho workload cần GPU, disk lớn hoặc instance type riêng.

3. Kiểm tra hiện trạng cluster trước khi bật autoscaling

kubectl get nodes -o wide
kubectl top nodes
kubectl get pods -A --field-selector=status.phase=Pending
kubectl describe node <node-name> | egrep -A5 "Allocated resources|Taints|Labels"
kubectl get pdb -A

Giải thích nhanh:

kubectl top nodes cho biết CPU/memory thực tế, nhưng autoscaler chủ yếu nhìn requests, không chỉ usage.
Pod pending là tín hiệu cluster thiếu capacity hoặc pod có constraint không thỏa.
Taints/labels quyết định pod có được schedule lên node pool nào.
PDB có thể chặn scale down nếu eviction làm vi phạm số replica tối thiểu.

4. Requests và limits: nền móng của autoscaling

Cluster autoscaler không đọc “cảm giác chậm” của ứng dụng. Nó nhìn pod pending và resource request. Nếu app không khai báo request, scheduler không thể tính capacity chuẩn.

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"

Ví dụ: một node còn 2 CPU allocatable. Nếu mỗi pod request 250m CPU, scheduler có thể đặt khoảng 8 pod theo CPU. Nếu request sai quá thấp, cluster bị overcommit và latency tăng. Nếu request quá cao, pod pending sớm và autoscaler thêm node không cần thiết.

5. Cấu hình HPA cho workload API

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 4
  maxReplicas: 40
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 65
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 30
        periodSeconds: 60

minReplicas giữ nền capacity tối thiểu. maxReplicas là chốt an toàn chống scale vô hạn khi app lỗi. behavior giúp scale up nhanh nhưng scale down từ tốn để tránh dao động.

6. Cluster Autoscaler: khi nào node được thêm?

Cluster Autoscaler thường scale up khi có pod pending vì không đủ CPU/memory hoặc không có node phù hợp. Nó scale down khi node underutilized đủ lâu và pod trên node có thể eviction an toàn.

kubectl -n kube-system logs deploy/cluster-autoscaler --tail=120
kubectl get events -A --sort-by=.lastTimestamp | tail -80

Output mẫu:

pod production/api-7c8d9 pending: 0/6 nodes are available: insufficient cpu
scale-up: setting group general-pool size from 3 to 4

Dòng này cho thấy pod pending do thiếu CPU, autoscaler chọn node group general-pool và tăng size.

7. Thiết kế node pool có guardrail

Một node pool production nên có giới hạn rõ:

min size: đủ chạy workload nền và chịu một node lỗi.
max size: dựa trên ngân sách, quota cloud và capacity plan.
instance type: không quá nhỏ khiến daemonset chiếm hết tài nguyên, không quá lớn gây lãng phí khi scale từng node.
availability zone: phân bổ đa zone để tránh single point of failure.
labels/taints: định tuyến workload đúng loại node.

kubectl label node ip-10-0-1-21 workload=general
kubectl taint nodes ip-10-0-2-15 dedicated=system:NoSchedule

Với cloud managed Kubernetes, labels/taints thường nên đặt ở cấu hình node group, không gắn thủ công từng node vì node mới sinh ra sẽ mất cấu hình nếu không khai báo từ đầu.

8. Karpenter: provision node linh hoạt hơn

Karpenter phổ biến trên EKS vì có thể tạo node theo nhu cầu pod nhanh và linh hoạt hơn node group cố định. Thay vì chỉ tăng size của group có sẵn, Karpenter chọn instance type phù hợp dựa trên constraint.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand", "spot"]
  limits:
    cpu: "200"
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m

limits.cpu là guardrail rất quan trọng. Nếu thiếu giới hạn, một lỗi HPA hoặc traffic bất thường có thể tạo quá nhiều node.

9. Spot instance: tiết kiệm nhưng phải biết chịu lỗi

Spot phù hợp với worker, batch job, queue consumer hoặc workload có retry tốt. Với API quan trọng, có thể dùng mix on-demand và spot nhưng cần PodDisruptionBudget, topology spread và graceful shutdown.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: api

PDB giúp tránh việc scale down hoặc node disruption làm rơi quá nhiều pod cùng lúc. Tuy nhiên PDB quá chặt có thể khiến autoscaler không scale down được.

10. Topology spread: tránh dồn pod vào một zone

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: ScheduleAnyway
  labelSelector:
    matchLabels:
      app: api

Nếu dùng DoNotSchedule, pod có thể pending khi một zone thiếu capacity, từ đó kích hoạt scale up. Nếu dùng ScheduleAnyway, scheduler linh hoạt hơn nhưng phân bổ có thể không hoàn hảo. Chọn theo mức nghiêm ngặt của SLA.

11. Troubleshooting pod pending không scale up

kubectl describe pod -n production <pending-pod>
kubectl get events -n production --sort-by=.lastTimestamp | tail -50
kubectl -n kube-system logs deploy/cluster-autoscaler --tail=200
kubectl get storageclass
kubectl get quota -A

Các nguyên nhân thường gặp:

Pod request CPU/memory lớn hơn mọi instance type được phép.
Node selector/affinity trỏ tới label không tồn tại.
Taint trên node pool nhưng pod thiếu toleration.
Cloud quota hết: vCPU, IP, subnet, disk hoặc instance family.
Persistent volume chỉ attach được trong một zone.
PDB hoặc safe-to-evict annotation cản scale down.

12. Troubleshooting scale down không hoạt động

Scale down khó hơn scale up vì autoscaler phải đảm bảo di chuyển pod an toàn.

kubectl describe node <node-name> | egrep -i "non-terminated|requested|allocated" -A20
kubectl get pods -A -o wide --field-selector spec.nodeName=<node-name>
kubectl get pdb -A
kubectl -n kube-system logs deploy/cluster-autoscaler | grep -i "unremovable" | tail -50

Nếu thấy pod có local storage, pod thuộc kube-system không có PDB phù hợp, hoặc annotation cluster-autoscaler.kubernetes.io/safe-to-evict=false, node có thể bị đánh dấu unremovable.

13. Monitoring autoscaling nên có gì?

Đừng bật autoscaling rồi chỉ hy vọng nó chạy. Ít nhất nên theo dõi:

Số pod pending theo namespace.
Thời gian từ pending đến running.
Số node theo node pool và theo zone.
CPU/memory request so với allocatable.
Cluster Autoscaler/Karpenter error log.
Chi phí node theo ngày và theo team/workload.
Số lần scale up/down trong giờ cao điểm.

kubectl get --raw /metrics | grep -E 'cluster_autoscaler|karpenter' | head
kubectl top pods -A --sort-by=cpu | head -20
kubectl get hpa -A

14. Cost control: autoscaling phải có ngân sách

Autoscaling tốt là scale đúng lúc, không phải scale vô hạn. Hãy đặt guardrail ở nhiều lớp:

maxReplicas trên HPA.
Max size node group hoặc limits trên Karpenter NodePool.
ResourceQuota theo namespace.
LimitRange để workload không request bừa.
Alert khi node tăng bất thường hoặc cost vượt ngưỡng.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "80"
    requests.memory: 160Gi
    limits.cpu: "160"
    limits.memory: 320Gi

15. Checklist nghiệm thu trước khi bật production

Mọi deployment quan trọng có CPU/memory requests hợp lý.
HPA có min/max replica và behavior scale up/down.
Node pool có min/max, đa zone và instance type phù hợp.
System add-on chạy trên node pool ổn định, không bị app chiếm tài nguyên.
PDB có cho workload quan trọng nhưng không quá chặt.
Đã test pod pending thật sự kích hoạt scale up.
Đã test scale down sau khi giảm tải.
Đã có alert pod pending, autoscaler error và chi phí tăng bất thường.
Runbook có lệnh kiểm tra, owner và quy trình rollback.

16. Bài tập lab

Tạo một cluster test có node pool min 1, max 3.
Deploy app nginx với request CPU 500m và replica cao để tạo pod pending.
Bật HPA, tạo tải bằng hey hoặc wrk.
Quan sát kubectl get hpa, kubectl get pods, kubectl get nodes mỗi 30 giây.
Giảm tải và đo thời gian scale down.
Ghi lại một runbook ngắn: triệu chứng, lệnh kiểm tra, nguyên nhân, cách xử lý.

Kết luận: Kubernetes cluster autoscaling chỉ an toàn khi requests đúng, node pool có giới hạn, scheduler constraint rõ ràng và monitoring đủ sâu. Hãy coi autoscaling là một phần của thiết kế production, không phải tính năng bật sau cùng khi cluster đã quá tải.