Mastering Kubernetes Resource Requests and Limits for Peak Performance

Kubernetes resource requests and limits look simple in YAML, but they shape almost every day-two behavior in a cluster: where pods land, which workloads get evicted first, whether an app gets CPU-throttled under load, and how much spare capacity you think you have. A bad setting can make a healthy application look broken. A missing setting can make the scheduler pack pods onto a node until the first traffic spike turns into a noisy-neighbor incident.

The part that catches teams is that requests and limits are used by different systems. Requests are mainly a scheduling promise. Limits are an enforcement boundary. Treating them as the same thing leads to strange outcomes, especially with CPU.

Understanding the Core Concepts: Requests vs. Limits

In Kubernetes, a container can define expected resource consumption using resources.requests and resources.limits. They are not technically mandatory unless your cluster uses policies such as LimitRange or admission controls, but production workloads should usually define at least requests for CPU and memory. Without requests, the scheduler has little useful information and the pod falls into a weaker quality-of-service posture.

1. Resource Requests (`requests`)

Requests represent the amount of resources a container is guaranteed to receive upon scheduling. This is the minimum amount of resources the kube-scheduler uses when deciding which node to place a Pod on.

Scheduling: A node must have enough available allocatable resources that satisfy the sum of all Pod requests before a new Pod can be scheduled there.
Runtime priority: Requests influence CPU shares and memory eviction decisions. They are not a magic reservation that prevents every slowdown, but they do give Kubernetes and the kernel better information during contention.

2. Resource Limits (`limits`)

Limits define the maximum amount of resources a container is allowed to consume. Exceeding these limits results in specific, defined behaviors for CPU and Memory.

CPU Limits: If a container attempts to use more CPU than its limit, the Linux kernel's cgroups will throttle its usage, preventing it from consuming further cycles.
Memory Limits: If a container exceeds its memory limit, the kernel can terminate a process in the container. Kubernetes reports this as OOMKilled when that is the recorded termination reason.

CPU vs. Memory Behavior

It is crucial to understand the qualitative difference in how Kubernetes enforces CPU versus Memory boundaries:

Resource	Behavior on Exceeding Limit	Enforcement Mechanism
CPU	Throttled (slowed down)	cgroups (cpu bandwidth control)
Memory	Terminated (OOMKill)	Kernel OOM Killer

Practical caution: CPU limits can protect a node from a runaway process, but they can also create latency problems when set too low. Many platform teams set memory limits consistently and are more selective with CPU limits for latency-sensitive services, depending on their risk tolerance and cluster policy.

Defining Resources in Pod Specifications

Resources are defined within the spec.containers[*].resources block. Quantities are specified using standard Kubernetes suffixes (e.g., m for milli-CPU, Mi for Mebibytes).

CPU Unit Definitions

1 CPU unit equals 1 full core (or vCPU on cloud providers).
1000m (millicores) equals 1 CPU unit.

Memory Unit Definitions

Mi (Mebibytes) or Gi (Gibibytes) are common.
1024Mi = 1Gi.

Example YAML Configuration

Consider a container that requires a guaranteed minimum of 500m CPU and 256Mi of memory, but should never exceed 1 CPU and 512Mi:

resources:
  requests:
    memory: "256Mi"
    cpu: "500m"
  limits:
    memory: "512Mi"
    cpu: "1"

The numbers should come from observed behavior, not guesswork. For a small HTTP API, a starting request might be based on normal p50 or p90 usage during business traffic, then adjusted after load testing. For a JVM service, memory needs to include heap, metaspace, native memory, thread stacks, direct buffers, and sidecar overhead. For a batch job, peak memory during the largest input may matter more than average memory.

Quality of Service (QoS) Classes

The relationship between Requests and Limits determines the Quality of Service (QoS) class assigned to a Pod. This class dictates the Pod's priority when resources become scarce and the node needs to reclaim memory (eviction).

Kubernetes defines three QoS classes:

1. Guaranteed

Definition: All containers in the Pod must have identical, non-zero Requests and Limits for both CPU and Memory.

Benefit: These Pods are the last to be evicted during resource pressure, ensuring maximum stability.
Use Case: Critical system components or databases requiring strict performance isolation.

2. Burstable

Definition: At least one container in the Pod has defined Requests, but either Requests and Limits are not equal for all containers, or some resources are not limited (though setting limits is highly recommended).

Benefit: Allows containers to burst beyond their requests, utilizing unused capacity on the node, up to their defined limits.
Eviction Priority: Evicted before BestEffort Pods, but after Guaranteed Pods.
Use Case: Most standard stateless applications where slight variation in latency is acceptable.

3. BestEffort

Definition: The Pod has no Requests or Limits defined for any container.

Benefit: None, other than simplicity.
Risk: These Pods are the first candidates for eviction when the node experiences memory pressure. They can also compete poorly for CPU because no request has been declared.
Use Case: Non-critical batch jobs or logging agents that can easily be restarted.

Practical Optimization Strategies

Effective resource management requires measurement, iteration, and careful planning.

Strategy 1: Measure and Set Requests Accurately

Requests should reflect the amount of resource the application needs to run acceptably most of the time. If you set requests too high, you waste cluster capacity because the scheduler treats that capacity as already spoken for. If you set them too low, the scheduler may place too many pods on a node, and the workload may be more likely to suffer during contention.

Use monitoring tools such as Prometheus and Grafana to compare request values against real usage. A common starting point is to look at several days of normal traffic, ignore obvious one-off incidents, and set requests near a sustained percentile rather than the single highest spike. The exact percentile is a policy choice; the main point is to use data and revisit it.

For example, if a service normally uses 180m CPU, peaks around 450m during deploy warm-up, and has rare spikes to 900m during a known batch task, setting a 900m request may waste capacity all day. Setting a 50m request may make the pod cheap to schedule but unstable under contention. A request around the normal sustained range, plus separate handling for the batch path, is often a better conversation.

Strategy 2: Define Conservative Limits

Limits act as a safety boundary, but they are not free. For memory, a limit slightly above measured peak usage can prevent one container from consuming the node. For CPU, a limit prevents a runaway process from using unlimited CPU, but aggressive limits can throttle a service even when the node has idle cores.

Warning on CPU Limits: Setting CPU limits below actual demand can cause visible latency through throttling. Burstable QoS is a reasonable fit for many stateless services, while Guaranteed QoS is better reserved for workloads where the isolation tradeoff is intentional.

Strategy 3: Leveraging Vertical Pod Autoscaler (VPA)

Manually tuning resources is difficult and time-consuming. The Vertical Pod Autoscaler (VPA) monitors runtime usage and can recommend or update resource requests, depending on its mode. In many setups, VPA is first used in recommendation mode so teams can review suggested requests before allowing automatic updates.

Be careful when combining VPA with Horizontal Pod Autoscaler. HPA often scales based on utilization relative to requests, so changing requests can change scaling behavior. It can work well, but it should be tested deliberately.

Strategy 4: Resource Quotas for Namespaces

To prevent resource hogging across teams or environments, administrators should use Resource Quotas at the Namespace level. A ResourceQuota imposes aggregate limits on the total amount of CPU/Memory Requests and Limits that can exist within that namespace, ensuring fairness.

Example Namespace Quota

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: development
spec:
  hard:
    requests.cpu: "10"
    limits.memory: "20Gi"

This ensures that the total requested CPU across all Pods in the development namespace cannot exceed 10 cores, and total memory limits cannot exceed 20Gi.

How Bad Settings Show Up in Real Clusters

Resource mistakes rarely announce themselves as "resource mistakes." They usually look like application incidents.

If CPU limits are too low, you may see high request latency while CPU usage graphs appear capped. The container wants more CPU, but cgroup throttling holds it back. In Prometheus-based setups, metrics such as container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total can help show whether throttling is part of the story.

If memory limits are too low, the pod may restart with Reason: OOMKilled. The application logs may end abruptly because the process did not get a graceful shutdown. kubectl describe pod usually tells the truth faster than the app log in this case.

If requests are too high, pods may sit in Pending even though dashboards show average node usage is low. The scheduler does not place pods based on average actual usage; it compares requested resources with allocatable resources. This is why a cluster can look underused and still refuse a new pod.

If requests are missing, the workload may look fine during calm periods and then become the first thing squeezed when the node is busy. That can be acceptable for disposable jobs, but it is a poor default for user-facing services.

A Safer Tuning Workflow

A practical tuning loop looks like this:

Start with explicit CPU and memory requests for every production container, including sidecars.
Set memory limits based on observed peak plus headroom, then watch for OOMKilled restarts.
Decide whether CPU limits are required by policy or workload risk. If you use them, monitor throttling.
Compare requested CPU and memory with actual usage weekly or monthly, especially after major releases.
Treat right-sizing as change management. A lower request can increase bin-packing density, but it can also change failure behavior during node pressure.

Sidecars deserve attention. A service mesh proxy, log shipper, or security agent can consume enough CPU or memory to change the pod's real footprint. If only the main app container gets tuned, the pod can still be misrepresented to the scheduler.

Example: Fixing a Latency Spike Caused by CPU Throttling

Imagine an API container with this configuration:

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "200m"
    memory: "512Mi"

During a sale event, latency jumps. The node still has idle CPU, but the container is capped at 200m. Raising replicas may help, but each pod is still individually throttled. A better fix might be to raise or remove the CPU limit, increase the request to match normal sustained demand, and use HPA so the service scales before latency gets ugly.

The important lesson is that low CPU usage in the graph does not always mean the app is idle. It may mean the app is not allowed to use more.

Final Check

Requests tell Kubernetes how to place and prioritize the pod. Limits tell the kernel where to stop it. Good values come from real metrics, load tests, and a clear decision about what matters more for each workload: density, isolation, latency, or cost. Revisit the values after traffic changes, dependency changes, and runtime upgrades. Stale resource settings are one of the quietest ways a cluster drifts into poor performance.