Kubernetes Observability Trên Cloud: Thiết Kế Logging, Metrics Và Tracing Cho Production

Kubernetes observability trên cloud không chỉ là cài Prometheus rồi vẽ vài dashboard CPU/RAM. Trong môi trường production, observability phải giúp đội vận hành trả lời nhanh ba câu hỏi: hệ thống đang lỗi ở đâu, người dùng bị ảnh hưởng thế nào, và thay đổi nào vừa làm rủi ro tăng lên.

Bài viết này đi theo hướng thực chiến cho Kubernetes chạy trên cloud như EKS, AKS, GKE hoặc cluster tự quản trên VM: thiết kế metrics, logging, distributed tracing, alert theo SLO, quy trình troubleshooting và checklist nghiệm thu để sysadmin/SRE triển khai có kiểm soát.

1. Observability khác monitoring ở điểm nào?

Monitoring truyền thống thường hỏi “server còn sống không?”. Observability hỏi sâu hơn: “vì sao request checkout chậm ở một nhóm người dùng?”, “pod restart vì OOM hay vì dependency timeout?”, “release mới ảnh hưởng p95 latency như thế nào?”.

Ba tín hiệu nền tảng cần có:

Metrics: số đo dạng time-series như request rate, error rate, latency, CPU, memory, queue depth.
Logs: sự kiện có ngữ cảnh, dùng để điều tra chi tiết một lỗi cụ thể.
Traces: đường đi của request qua nhiều service, rất quan trọng với microservices.

2. Kiến trúc tham khảo trên cloud

Một kiến trúc gọn nhưng đủ dùng cho production gồm:

Prometheus hoặc managed Prometheus để scrape metrics.
Grafana để dashboard và alert visualization.
Loki/Elasticsearch/OpenSearch hoặc cloud logging native để lưu log.
OpenTelemetry Collector để nhận metrics/logs/traces và định tuyến về backend.
Tempo/Jaeger hoặc dịch vụ tracing managed.
Alertmanager hoặc hệ thống incident như PagerDuty/Opsgenie/Slack webhook nội bộ.

Nguyên tắc quan trọng: workload không nên gửi telemetry trực tiếp đến quá nhiều backend. Hãy dùng OpenTelemetry Collector như lớp đệm để chuẩn hóa, lọc dữ liệu nhạy cảm và đổi backend khi cần.

3. Cài kube-prometheus-stack bằng Helm

Trong lab, cách nhanh nhất để có Prometheus, Grafana, Alertmanager và rule mặc định là dùng Helm chart kube-prometheus-stack:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack   --namespace monitoring   --set grafana.adminPassword='change-me-in-secret-manager'

Kiểm tra pod:

kubectl -n monitoring get pods
kubectl -n monitoring get svc

Output mong đợi:

NAME                                                        READY   STATUS
kube-prometheus-stack-grafana-xxxxx                        3/3     Running
prometheus-kube-prometheus-stack-prometheus-0              2/2     Running
alertmanager-kube-prometheus-stack-alertmanager-0          2/2     Running

Không nên expose Grafana bằng LoadBalancer public không bảo vệ. Dùng VPN, private ingress, SSO/OIDC hoặc ít nhất IP allowlist.

4. Metrics nên theo dõi cho Kubernetes production

4.1. Cluster và node

# Node không sẵn sàng
kube_node_status_condition{condition="Ready",status="true"} == 0

# Disk pressure / memory pressure
kube_node_status_condition{condition=~"DiskPressure|MemoryPressure",status="true"} == 1

4.2. Pod và container

# Pod restart trong 15 phút
increase(kube_pod_container_status_restarts_total[15m]) > 0

# Container bị OOMKilled
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

4.3. Application RED metrics

Với service HTTP, hãy ưu tiên RED: Rate, Errors, Duration.

# Request rate
sum(rate(http_requests_total[5m])) by (service)

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# p95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Dashboard đẹp nhưng không có metric ứng dụng thì chỉ thấy “hạ tầng đang bận”, không thấy người dùng đang đau ở đâu.

5. Logging: log phải có cấu trúc và correlation id

Log production nên ở dạng JSON hoặc ít nhất có trường cố định. Các trường tối thiểu:

timestamp, level, service, environment
request_id hoặc trace_id
user_id dạng hash nếu cần, không ghi PII thô
error_code, duration_ms, status

Ví dụ log tốt:

{"level":"error","service":"checkout-api","trace_id":"8c1f...","route":"/checkout","status":500,"duration_ms":842,"error":"payment_gateway_timeout"}

Khi dùng Fluent Bit hoặc Vector để đẩy log, hãy cấu hình filter loại bỏ secret như token, password, cookie session trước khi rời cluster.

6. Distributed tracing với OpenTelemetry

Tracing đặc biệt hữu ích khi một request đi qua ingress, API gateway, service A, message queue, service B và database. Cài OpenTelemetry Collector bằng Helm:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
kubectl create namespace observability
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector   --namespace observability   --set mode=deployment

Pipeline cơ bản gồm receiver OTLP, processor batch/memory_limiter và exporter về Tempo/Jaeger hoặc backend cloud. Với ứng dụng, bật instrumentation theo ngôn ngữ: Java agent, Node.js SDK, Python SDK, Go SDK.

Một trace tốt phải cho thấy span nào chậm, lỗi ở dependency nào, và có tag đủ rõ như service.name, deployment.environment, http.route, db.system.

7. Alert theo SLO, tránh spam cảnh báo

Cảnh báo quá nhiều sẽ làm đội trực bỏ qua alert. Hãy chuyển từ alert tài nguyên đơn lẻ sang alert theo tác động người dùng:

Error rate vượt ngưỡng trong 5-15 phút.
p95/p99 latency vượt SLO.
Availability thấp hơn mục tiêu.
Queue lag tăng liên tục.
Pod crash loop của service quan trọng.

Ví dụ alert p95 latency:

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])) by (le)
) > 0.8

Alert nên có runbook link, dashboard link, query gợi ý và owner. Một alert “HighLatency” không có hướng xử lý chỉ tạo thêm nhiễu.

8. Quy trình troubleshooting mẫu

Khi nhận cảnh báo checkout latency tăng:

Mở dashboard RED của service: request rate có tăng bất thường không, error ratio bao nhiêu?
Kiểm tra release gần nhất: có deployment mới trong 30 phút qua không?
Xem pod restart/OOMKilled: kubectl -n prod get pods và kubectl -n prod describe pod ....
Mở trace chậm nhất, tìm span chiếm thời gian: database, Redis, external API hay internal service.
Query log theo trace_id để xem error chi tiết.
Nếu liên quan release, rollback bằng kubectl rollout undo deployment/checkout-api -n prod.
Sau rollback, xác nhận p95 latency và error ratio trở lại ngưỡng bình thường.

kubectl -n prod rollout history deployment/checkout-api
kubectl -n prod rollout undo deployment/checkout-api
kubectl -n prod rollout status deployment/checkout-api

9. Lỗi thường gặp

9.1. Lưu quá nhiều log nhưng không query được

Log không có cấu trúc, thiếu label và không có trace_id sẽ làm chi phí tăng nhưng điều tra vẫn chậm.

9.2. Cardinality metrics quá cao

Đừng đưa user_id, email, full URL query string vào label Prometheus. Cardinality cao có thể làm Prometheus tốn RAM và query chậm.

9.3. Dashboard chỉ có CPU/RAM

CPU cao không luôn là incident. Error rate và latency mới phản ánh trực tiếp trải nghiệm người dùng.

9.4. Không kiểm soát chi phí telemetry

Trên cloud, log/tracing có thể tốn tiền nhanh. Cần retention theo môi trường, sampling trace và filter log debug ở production.

10. Checklist nghiệm thu production

Cluster có metrics node, pod, deployment, ingress và application RED.
Log có cấu trúc, có trace_id/request_id, không chứa secret/PII thô.
Tracing hoạt động qua các service quan trọng.
Dashboard có view theo service owner và môi trường.
Alert bám theo SLO, có runbook, có severity rõ.
Retention log/metrics/traces phù hợp chi phí.
Grafana/Prometheus không public trần trên Internet.
Có quy trình rollback và xác minh sau rollback.
Đội trực đã chạy thử một incident drill từ alert đến root cause.

11. Bài lab cuối bài

Dựng một cluster kind/minikube hoặc cluster managed nhỏ. Deploy một ứng dụng demo có endpoint /success, /error, /slow. Sau đó:

Cài kube-prometheus-stack và mở Grafana qua port-forward.
Thêm dashboard RED cho ứng dụng.
Gửi traffic bằng hey hoặc wrk.
Tạo lỗi 500 và quan sát error ratio.
Bật OpenTelemetry SDK cho ứng dụng demo và xem trace chậm.
Viết một alert p95 latency, kèm runbook xử lý.

Lab này giúp bạn thấy rõ observability không phải “cài tool”, mà là thiết kế tín hiệu để điều tra incident nhanh và giảm thời gian khôi phục.

Kết luận

Với Kubernetes trên cloud, observability tốt phải kết nối được metrics, logs và traces thành một câu chuyện vận hành thống nhất. Khi alert chỉ ra tác động người dùng, dashboard cho thấy xu hướng, trace tìm được span chậm và log cung cấp ngữ cảnh, đội sysadmin/SRE có thể xử lý incident nhanh hơn, rollback tự tin hơn và cải thiện SLO một cách có dữ liệu.