AWS Fargate: Resources, Sidecars, and Pod Evictions
Fargate is AWS’s serverless compute engine for containers. Instead of managing EC2 nodes, each pod gets its own isolated micro-VM. This removes node management overhead but introduces some differences from standard EC2-backed EKS nodes worth knowing about. If you’re using Karpenter on EC2 node groups, you can optimise compute in a much more flexible way, with instance selection that adapts to your pod requirements dynamically.
This post explains some of pros and cons of Fargate, how memory reservation works, the evolution of the Datadog sidecar pattern, and how to handle AWS’s periodic Fargate pod evictions proactively. There is a balance between choosing Fargate vs EC2 fleet with Karpenter.
Memory: What You Actually Get
Fargate reserves a portion of memory for its own overhead, and any sidecars you run (e.g. Datadog agent) consume additional memory on top of that. Your application container gets what’s left.
With a Datadog sidecar (192Mi):
Fargate 1GB = 1024Mi - 256Mi(fargate_reserved) - 192Mi(datadog) = 576Mi available
Fargate 2GB = 2048Mi - 256Mi(fargate_reserved) - 192Mi(datadog) = 1600Mi available
Fargate 3GB = 3072Mi - 256Mi(fargate_reserved) - 192Mi(datadog) = 2624Mi available
Fargate 4GB = 4096Mi - 256Mi(fargate_reserved) - 192Mi(datadog) = 3648Mi available
Fargate 6GB = 6144Mi - 256Mi(fargate_reserved) - 192Mi(datadog) = 5696Mi available
Fargate 8GB = 8192Mi - 256Mi(fargate_reserved) - 192Mi(datadog) = 7744Mi available
Always account for sidecar memory when sizing your Fargate pods; the numbers in the AWS console are the total for the pod, not what your app container gets.
CPU and Memory Requests
On Fargate, resource requests define what AWS provisions, so treat them as resources of a VM you are requesting. You must set requests explicitly, otherwise you get the Fargate default (0.25 vCPU, 512Mi) which is too small for most workloads.
Do not set resource limits on Fargate. Removing limits allows the container to burst up to the full capacity of the provisioned Fargate instance:
resources:
requests:
cpu: 950m # 1 vCPU minus a small buffer
memory: 2624Mi # 3GB Fargate tier minus overhead
# No limits — allows bursting to full Fargate instance capacity
Fargate Rounds Up to the Next Valid Configuration
Fargate only supports specific vCPU and memory combinations. If your request doesn’t match a valid config exactly, AWS rounds up to the next one; and you pay for the larger size.
For example, requesting 1 vCPU and 8GB memory will provision a 2 vCPU / 9GB instance because no 1 vCPU / 9GB configuration exists. Make sure your requests align with valid Fargate configurations to avoid unexpected cost.
Datadog agent sidecar
The Old Way: Manual Sidecar in Helm
Previously, running a Datadog agent on Fargate required explicitly defining the sidecar container in your Helm chart; the agent image, environment variables, security context, and log path all had to be configured per service:
datadog:
enabled: true
secretName: datadog-cluster-agent
logPath: /logs/service.log
sidecar:
image:
repository: datadog/agent
tag: 7-jmx
securityContext:
runAsNonRoot: true
runAsUser: 100
runAsGroup: 100
readOnlyRootFilesystem: false
allowPrivilegeEscalation: false
env:
DD_APM_ENABLED: true
DD_SITE: datadoghq.eu
DD_EKS_FARGATE: true
DD_LOGS_ENABLED: true
DD_CONTAINER_EXCLUDE: name:datadog-agent
DD_CLUSTER_AGENT_ENABLED: true
DD_CLUSTER_AGENT_URL: https://datadog-cluster-agent.monitoring.svc:5005
DD_ORCHESTRATOR_EXPLORER_ENABLED: true
This works but means every service is responsible for managing its own sidecar config, keeping the agent version up to date, and ensuring the configuration is correct.
The New Way: Mutating Webhook Injection
Datadog now supports automatic sidecar injection on EKS Fargate via a mutating admission webhook on the cluster agent. When a pod has the label agent.datadoghq.com/sidecar: "fargate", the cluster agent automatically injects the Datadog agent sidecar at pod startup; no manual sidecar definition needed in your Helm chart.
podLabels:
agent.datadoghq.com/sidecar: "fargate"
That’s it. The cluster agent handles the rest; image version, environment variables, APM, log collection.
Rather than reading log files from a shared volume (as the manual sidecar approach does), the injected agent tails logs directly from the kubelet log endpoint. This means no log file path configuration, no shared volume mounts between containers, and logs are collected even if the application doesn’t write to a file.
Reference: https://www.datadoghq.com/blog/eks-fargate-logs-datadog/
Benefits over the manual approach:
- Agent version is managed centrally by the platform team, not per service
- No sidecar boilerplate in every Helm chart
- Consistent configuration across all services
- Easier to roll out agent upgrades cluster-wide
Fargate vs EC2 Nodes: Key Differences
| Fargate | EC2 nodes | |
|---|---|---|
| Node management | None | You manage node groups |
| Startup time | Slower (new micro-VM per pod) | Faster (pod schedules onto existing node) |
| DaemonSets | Not supported | Supported |
| Sidecar injection | Required for node-level agents | DaemonSet handles it |
| Cost model | Per pod (vCPU + memory) | Per node (instance type) |
| Isolation | Strong (dedicated VM per pod) | Shared node |
The lack of DaemonSet support is the main operational difference; anything that would normally run as a DaemonSet (log collectors, agents, proxies) must instead run as a sidecar in each pod. The Datadog mutating webhook injection approach above is the cleanest solution to this.
This also means reduced observability at the infrastructure layer. On EC2 nodes you can run node-level agents that give you host metrics, network stats, disk I/O, and process-level visibility. On Fargate, none of that is available; you can’t run a DaemonSet, you have no access to the underlying host, and visibility is limited to what your pod-level sidecars can report. You lose:
- Host-level CPU, memory, disk, and network metrics
- Node-level network flow data
- Process-level visibility outside your own containers
- Any tooling that requires privileged access to the host
For most workloads this is an acceptable trade-off, but it’s worth factoring in if you have strict observability or security monitoring requirements that depend on host-level data.
AWS Fargate Pod Evictions
AWS periodically needs to patch and update the underlying infrastructure that runs Fargate micro-VMs. When this happens, AWS will SIGKILL any pod running on the affected node without warning; there is no graceful shutdown, no SIGTERM, no drain. The pod is simply killed.
AWS does publish upcoming eviction events in advance via the AWS Health API, but you have to proactively query for them. If you don’t, you’ll just see pods disappear unexpectedly.
Querying Upcoming Evictions
Upcoming Fargate eviction events can be fetched via the AWS Health API. Note: the Health API is always queried against us-east-1 regardless of where your clusters are:
aws health describe-events \
--region us-east-1 \
--filter services=EKS,eventStatusCodes=upcoming,eventTypeCodes=AWS_EKS_FARGATE_POD_EVICTIONS \
--output json
This returns event ARNs. You then call describe-affected-entities with those ARNs to get the specific pod ARNs in the format:
arn:aws:ecs:<region>/<cluster-name>/<namespace>/<pod-name>
The Mitigation: Daily Proactive Restart
Since evictions are published in advance, the approach is to run a daily scheduled job that:
- Fetches upcoming eviction events from the AWS Health API
- Resolves affected pod names to their parent deployments/rollouts
- Checks whether those pods are still actually running (the eviction may already be resolved)
- Restarts any affected deployments before AWS forcibly kills them
This turns an uncontrolled SIGKILL into a controlled rolling restart with proper drain and health checks.
# GitHub Actions scheduled workflow
on:
schedule:
- cron: '30 10 * * *' # Daily at 10:30 UTC
workflow_dispatch: # Also allow manual trigger
The script logic:
# 1. Fetch upcoming eviction events (always us-east-1 for Health API)
eventsJSON=$(aws health describe-events \
--region us-east-1 \
--filter services=EKS,eventStatusCodes=upcoming,eventTypeCodes=AWS_EKS_FARGATE_POD_EVICTIONS)
# 2. Get affected entity ARNs — batched in chunks of 10 (API limit)
eventArns=$(jq -r '.events[].arn' <<< $eventsJSON | paste -sd ',' -)
entities=$(aws health describe-affected-entities \
--filter eventArns=$eventArns \
--region us-east-1)
# 3. Extract cluster, namespace, pod name from entity ARNs
# Format: arn:aws:ecs:<region>/<cluster>/<namespace>/<pod>
for entity in $affectedEntities; do
namespace=$(echo $entity | cut -d'/' -f3)
cluster=$(echo $entity | cut -d'/' -f2)
pod=$(echo $entity | cut -d'/' -f4)
# Derive deployment name by stripping pod hash suffix
serviceName=$(echo $pod | sed -E 's/(.*)-[^-]*-[^-]*$/\1/')
done
# 4. Cross-reference against currently running pods
# (skip if the pod is already gone — eviction may be resolved)
kubectl get pods -n $namespace
# 5. Restart affected deployments or Argo Rollouts
kubectl rollout restart deployment $serviceName -n $namespace
# or
kubectl argo rollouts restart $serviceName -n $namespace
I can’t take credit for this script, thanks to Nick for working out the details of the Health API queries and the pod-to-deployment logic.
Reasons to think twice about Fargate
The eviction problem is one symptom of a broader set of trade-offs. There’s a number of reasons to think carefully before adopting Fargate. Personally, I prefer EC2 node groups with Karpenter for most workloads, and would consider Fargate for specific use cases like cluster-level controllers such as Karpenter, AWS Load Balancer Controller, or cert-manager; but for application workloads, EC2 nodes are usually the better choice. The reasons:
- No graceful node drain. On EC2, AWS drains nodes before patching; pods get SIGTERM, drain windows are respected, PodDisruptionBudgets are honoured. On Fargate, none of this applies.
- No pod-level eviction warning. AWS Health API gives advance notice, but the pod itself receives no SIGTERM before the underlying host is replaced.
- Operational overhead to compensate. You need a daily automated job just to avoid random pod deaths. That job has to understand both
DeploymentandRolloutresource types if you use Argo Rollouts. - No DaemonSets. Any tooling that runs as a DaemonSet (log shippers, security agents, node exporters) simply doesn’t work on Fargate. You’re forced into sidecar patterns or giving things up entirely.
- Limited host-level metrics. Standard CPU and memory metrics are available via CloudWatch Container Insights, but network throughput, disk IO, and host-level pressure metrics aren’t exposed. You can’t run the node exporter or attach a DaemonSet-based metrics collector.
- ~3x compute cost. Fargate prices CPU and memory at a significant premium over equivalent EC2 capacity. At scale, this adds up fast.
- No instance flexibility. With Karpenter on EC2, instance selection adapts dynamically to your pod requirements; you can mix instance families, spot and on-demand, and bin-pack efficiently. Fargate gives you a fixed vCPU/memory menu with no bin-packing.
- Pod size limits. Fargate caps at 16 vCPU and 120 GB memory per pod. Not a problem for most workloads, but worth knowing.
- Slower cold starts. Fargate provisions a new microVM per pod. Startup latency is higher than scheduling onto a warm EC2 node.
Fargate makes sense for workloads that need strong isolation between tenants and don’t need host-level observability or DaemonSets. For a typical production Kubernetes cluster running persistent services, EC2 node groups with Karpenter give you more control, better observability, and lower cost.