Case Studies · Last updated June 2026

Case Study: GitOps on Amazon EKS — Moving Production Deployments from kubectl to ArgoCD and Helm

Outcomes

Deploying to production became a reviewed pull request merge instead of a manual kubectl session
Rollback is a git revert: about two minutes back to any previous release, with an audit trail
CI pipelines no longer hold cluster credentials at all
Config drift is gone — ArgoCD detects manual changes and self-heals them back to the git state
Full observability via kube-prometheus-stack, with Alertmanager paging a Slack channel
Roughly 30% node cost reduction after right-sizing requests/limits and adding autoscaling

How Deployments Worked Before

A subset of the client environments I look after run on Amazon EKS. When I took over one of these setups, deployments worked the way a lot of first Kubernetes pipelines do: a CI job built an image, then ran kubectl apply against a directory of raw manifests. The kubeconfig it used had cluster-admin. Every pipeline run was, in effect, an unsupervised admin session against production.

The day-to-day problems followed from that. Nobody could answer the question "what exactly is running in prod right now" with certainty, because the manifests in git only described what CI had applied last time. Hotfixes made by hand with kubectl edit never made it back into the repo and accumulated as silent drift between environments. Rolling back meant digging through old pipeline logs to figure out which manifest versions and image tags had been live before the bad deploy.

The constraints were ordinary ones. A small DevOps team supports many environments, so anything that needs per-cluster babysitting was out. Dev, staging, and prod had to behave identically while differing in size. The migration could not interrupt running workloads. The clients are cost-sensitive, and the tooling had to be open source.

Cluster as Code: Terraform, Managed Node Groups, and IRSA

The first step was getting the clusters themselves fully into Terraform. I reused the Terraform module library from my multi-environment AWS case study to define each cluster: VPC, EKS control plane, managed node groups, and the OIDC provider that EKS exposes for IAM Roles for Service Accounts. With IRSA in place, pods assume narrowly scoped IAM roles instead of inheriting whatever the node's instance profile allows, and that became the standard way workloads get AWS API access.

Because dev, staging, and prod come from the same modules with different variable values, the environments differ only where they are supposed to: instance types, node counts, and scaling bounds.

The ArgoCD App-of-Apps Pattern

ArgoCD runs in each cluster and watches a dedicated GitOps repository. The layout follows the app of apps pattern: one bootstrap Application per cluster points at a directory in the repo, and that directory contains an Application manifest for every service the cluster should run. Bringing a new service into an environment means adding one file and merging. Removing it is a deletion that ArgoCD prunes.

This also covers the platform itself. The ingress controller, cert-manager, and the monitoring stack are Applications in the same tree, so a brand-new cluster goes from empty to fully populated by applying a single bootstrap manifest.

Helm Charts with Per-Environment Values

Each service is packaged as a Helm chart with per-environment values files: values-dev.yaml, values-stg.yaml, values-prod.yaml. The chart holds everything the environments share. The values files hold what differs, which in practice is replica counts, resource requests, hostnames, and a handful of feature flags. That structure is what keeps the three environments behaving identically at different sizes, because all of them render from the same chart.

Sync policy differs by environment. Dev and staging Applications use automated sync with prune and selfHeal enabled: whatever is in git is what runs, and a manual change to the cluster gets reverted within minutes. Prod Applications sit behind a manual sync gate plus sync windows, so a merge to a prod values file stages a diff that someone approves in the ArgoCD UI during an allowed window. Sync waves handle ordering — database migration Jobs are annotated with an earlier wave than the Deployments that depend on the new schema, so the rollout never races its own migration.

# applications/stg/orders-api.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: orders-api-stg
  namespace: argocd
spec:
  project: workloads
  source:
    repoURL: https://github.com/<org>/gitops-config.git
    targetRevision: main
    path: charts/orders-api
    helm:
      valueFiles:
        - values.yaml
        - values-stg.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: orders
  syncPolicy:
    automated:
      prune: true      # delete resources removed from git
      selfHeal: true   # revert manual cluster changes
    syncOptions:
      - CreateNamespace=true

Pull-Based Delivery: CI Builds, Git Carries the Deploy

With ArgoCD pulling from git, the CI pipeline's job shrinks. GitHub Actions builds the image, pushes it to ECR, and commits a one-line image tag bump to the GitOps repo. ArgoCD notices the commit and syncs. I evaluated argocd-image-updater, which can watch ECR and write tag updates automatically, but settled on explicit bump commits: every deploy is a commit with an author and a diff, and the git log of the GitOps repo doubles as the deploy history.

The consequence I cared about most is that CI lost its cluster access entirely. The cluster-admin kubeconfigs were deleted from CI secrets, and the only credential the pipeline holds now is permission to push images and commit to one repository. ArgoCD pulls from git inside the cluster, so no external system needs credentials for the Kubernetes API.

Right-Sizing and Autoscaling

The original manifests had no resource requests or limits on most containers, so the scheduler had nothing to plan with and the node groups were sized by guesswork. I pulled real CPU and memory usage from container metrics, set requests near observed steady-state with limits leaving headroom, then added HorizontalPodAutoscalers on the services with variable load and cluster-autoscaler on the node groups.

Once requests reflected reality, bin-packing improved enough to drop the node count, and the autoscaler trimmed capacity outside business hours. Node spend came down roughly 30%. Starting the same project today I would evaluate Karpenter for node provisioning, since it picks instance shapes per workload instead of scaling fixed-shape groups.

Observability with kube-prometheus-stack

Monitoring went in through the same pipeline as everything else: kube-prometheus-stack installed from its Helm chart, managed by its own ArgoCD Application, so Prometheus, Grafana, and Alertmanager are versioned alongside the workloads they watch. Grafana carries golden-signal dashboards per service (latency, traffic, errors, saturation), and Alertmanager pages a Slack channel. Loki collects logs, which means on-call can read application logs without needing kubectl access at all.

The alert rules that have mattered most are basic ones: containers in CrashLoopBackOff, pods stuck Pending, node memory and disk pressure, and certificates approaching expiry.

GitOps vs kubectl apply: What Actually Changed

Deploy lead time went from a supervised 30 to 45 minutes, where an engineer watched the pipeline and then poked at the cluster to confirm, to under 10 minutes hands-off. The GitOps repo answers "what is running in prod" by inspection, and ArgoCD shows live state against desired state for every service.

Rollback is a git revert on the GitOps repo. ArgoCD syncs the previous state in about two minutes, and the revert commit records who rolled back what, and when. Since selfHeal went live there have been zero drift incidents, because ArgoCD reverts manual changes back to the git state. On-call diagnosis also got quicker in a way I hadn't fully anticipated. Grafana annotations mark each deploy on the metric graphs, so it is straightforward to check whether a problem started with a release.

What I'd Do Differently

Use ApplicationSets from the start. I copy-pasted Application manifests per service per environment, and by the time I converted to an ApplicationSet generator there were dozens of nearly identical files to clean up.
Set resource requests and limits when the first workloads go in. The ~30% cost reduction came from fixing a gap that had been there since the original manifests were written.
Manage secrets through GitOps from the beginning rather than retrofitting it. The realistic options are sealed-secrets or SOPS; the SOPS with KMS pattern from my DevSecOps pipeline case study is what I would reach for now.
Evaluate Karpenter for the spiky workloads. Cluster-autoscaler with fixed node group shapes leaves capacity idle when load varies a lot.

My Kubernetes / EKS operations service starts from this exact playbook.

I move teams from push-based kubectl pipelines to ArgoCD-managed GitOps on EKS, including the Terraform, Helm structure, and monitoring around it.

See Services