Instead of asking Claude a single question, I wanted to see what would happen if I gave it a troubleshooting workflow similar to how an experienced platform engineering team operates.
When a production incident happens, engineers rarely investigate everything themselves.
- One person looks at Kubernetes events.
- Another checks pod health and logs.
- Someone investigates networking.
- A platform engineer reviews autoscaling and node provisioning
The findings are then combined to identify the actual root cause.
Claude Code workflows make it possible to replicate this approach using sub-agents.
The Five Specialized Troubleshooting Agents
For this experiment, I created five specialized agents:
Events Agent
Responsible for collecting and analyzing Kubernetes events.
Focus areas:
- Failed scheduling
- BackOff events
- Warning events
- Cluster-wide failures
Pod Health Agent
Responsible for pod status and application health.
Focus areas:
- Container restarts
- CreateContainerConfigError
- CrashLoopBackOff
- Application logs
Networking Agent
Responsible for service discovery and network troubleshooting.
Focus areas:
- Services
- Endpoints
- Ingress resources
- Network policies
Karpenter Agent
Responsible for autoscaling and provisioning.
Focus areas:
- NodePools
- NodeClaims
- Provisioning failures
- Capacity constraints
Infrastructure Agent
Responsible for cluster capacity and resource analysis.
Focus areas:
- Node health
- Resource utilization
- Allocatable resources
- Infrastructure bottlenecks
Each agent had a specific responsibility and was instructed to collect evidence rather than jump to conclusions.
A coordinator agent reviewed all findings before producing a diagnosis.
Testing the Workflow Against Real EKS Failures
To evaluate the workflow, I intentionally introduced multiple failures into a test namespace.
The namespace contained three independent problems:
Failure 1: Invalid NodePool Requirements
A workload requested a node label that did not exist in any configured NodePool.
Result:
- Pods remained Pending
- Scheduler generated FailedScheduling events
- Karpenter could not satisfy requirements
Failure 2: Missing Kubernetes Secret
A deployment referenced a Secret that did not exist.
Result:
- Pods entered CreateContainerConfigError
- Containers never started
- Image pulls completed successfully
Failure 3: Karpenter Provisioning Blocked
NodePool CPU limits were configured incorrectly.
Result:
- NodePools appeared healthy
- No NodeClaims were created
- No new EC2 instances launched
From a user perspective, everything simply looked broken.
The interesting question was whether Claude could identify all three issues independently.
What The Agents Discovered
The Events Agent immediately identified multiple failure signatures rather than a single cluster-wide issue.
The Pod Health Agent recognized that some workloads were failing before container startup while others were unable to schedule at all.
The Karpenter Agent discovered that NodePool CPU limits had effectively disabled provisioning.
The Infrastructure Agent verified that existing nodes still had available capacity and ruled out infrastructure exhaustion.
The Networking Agent investigated service discovery and network resources and concluded that networking was not responsible for any of the observed failures.
This was important because networking is often blamed first during Kubernetes incidents. In this case, networking was a red herring.
The Final Diagnosis
After combining the findings from all five agents, Claude Code identified three independent root causes:
- A missing Kubernetes Secret causing CreateContainerConfigError
- Invalid NodePool requirements preventing scheduling
- Karpenter CPU limits preventing node provisioning
It also identified several secondary findings:
- Networking was not contributing to the failures
- Existing nodes still had available resources
- Metrics Server was unhealthy
- NodePool configuration had drifted from its original state
Most importantly, Claude did not assume that every symptom shared the same root cause.
That was the biggest difference compared to a traditional single-prompt investigation.
Why This Approach Worked
The value wasn’t that Claude magically fixed Kubernetes.
The value was that five specialized agents investigated different layers of the platform simultaneously.
Instead of manually switching between:
- Events
- Logs
- Services
- Endpoints
- NodePools
- Infrastructure
the workflow gathered evidence across the entire stack in parallel.
For Amazon EKS environments, that can significantly reduce the time required to identify the actual source of a problem. This was particularly noticeable for Karpenter-related failures.
If you’ve read my previous article:
Karpenter Not Launching Nodes in EKS: Real Debugging Scenarios you’ll know that provisioning issues often look like scheduling issues.
Similarly, networking symptoms can be misleading. In: Why Your Kubernetes Service Has No Endpoints (And How to Fix It)
and Why Your Kubernetes Pods Are Running But Not Reachable And How to Fix It
the visible symptom appeared to be networking, while the actual root cause was elsewhere. The multi-agent approach helped separate symptoms from causes much faster.
The CLAUDE.md File
One of the biggest lessons from this experiment was that the results weren’t coming from a magical prompt.
The quality of the investigation came from the workflow.
Instead of asking Claude Code a generic question, I gave it a repeatable troubleshooting process through a CLAUDE.md file.
This file defines:
- What agents should be created
- Which commands are allowed
- How evidence should be collected
- How findings should be reported
- What safety guardrails must be followed
For production environments, these guardrails are critical. I don’t want an AI assistant making changes to an EKS cluster in Production. I want it gathering evidence, identifying likely root causes, and recommending verification steps. The rules I used were intentionally restrictive.
Rules
READ-ONLY access only
NEVER run:
- kubectl delete
- kubectl apply
- kubectl patch
- kubectl edit
- kubectl scale
- helm install
- helm upgrade
ONLY run:
- kubectl get
- kubectl describe
- kubectl logs
- kubectl top
- aws ec2 describe-*
- aws service-quotas
Do not fix anything.
Only diagnose and recommend.
With those guardrails in place, I was comfortable letting Claude Code investigate the cluster because every action was observational.
The workflow looked like this:
Coordinator Agent
│
├── Events Agent
├── Pod Health Agent
├── Networking Agent
├── Karpenter Agent
└── Infrastructure Agent
And here’s the complete CLAUDE.md file I used.
# EKS Troubleshooting Agent
You are a Kubernetes troubleshooting coordinator for an Amazon EKS cluster.
## Rules
- READ-ONLY access only
- NEVER run: kubectl delete, kubectl apply, kubectl patch, kubectl edit, kubectl scale, helm install/upgrade
- ONLY run: kubectl get, kubectl describe, kubectl logs, kubectl top, aws ec2 describe-*, aws service-quotas
- Do NOT fix anything — only diagnose and recommend
## Workflow
When investigating an issue, spawn sub-agents in parallel to investigate different layers simultaneously.
### Sub-Agent 1: Events Agent
Collects and analyzes cluster events for the target namespace:
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events -n <namespace> --field-selector reason=Failed
kubectl get events -n <namespace> --field-selector reason=BackOff
kubectl get events -n <namespace> --field-selector reason=FailedScheduling
### Sub-Agent 2: Pod Health Agent
Checks pod status, container state, and logs:
kubectl get pods -n <namespace> -o wide
kubectl describe pod -n <namespace> <pod-name>
kubectl logs -n <namespace> <pod-name> --previous --tail=50
kubectl logs -n <namespace> <pod-name> --tail=50
### Sub-Agent 3: Networking Agent
Checks services, endpoints, and network policies:
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
kubectl get networkpolicy -n <namespace>
kubectl get ingress -n <namespace>
### Sub-Agent 4: Karpenter Agent
Checks autoscaling and node provisioning:
kubectl get nodepool -o yaml
kubectl get nodeclaim
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100
kubectl get nodepool -o jsonpath='{.items[*].spec.limits}'
### Sub-Agent 5: Infrastructure Agent
Checks node capacity and resource usage:
kubectl get nodes -o wide
kubectl top nodes
kubectl top pods -n <namespace>
kubectl describe nodes
## Output Format
### Evidence Summary
Summarize findings from each agent.
### Root Cause Analysis
Identify the actual root cause and explain why other symptoms are secondary.
### Recommended Fix
Describe the fix but do not execute it.
### Verification Steps
Provide commands that can be used to verify the issue is resolved.
Feel free to adapt this workflow for your own EKS clusters.
For example, you could add specialized agents for:
- AWS Load Balancer Controller
- Istio
- ExternalDNS
- Argo CD
- Amazon Bedrock workloads
- FastMCP services
The important part isn’t the exact commands.
The goal is not to automate production changes. The goal is to automate evidence collection and root cause analysis.