Optimizing Linux Network Throughput by Tuning TCP/IP sysctl Parameters
Practical Linux TCP sysctl tuning for throughput, buffers, congestion control, and safe testing.
Optimizing Linux Network Throughput by Tuning TCP/IP sysctl Parameters
Linux network throughput tuning starts with a boring question: what is actually slow? A server that tops out at 300 Mbps on a 10 Gbps link may have a TCP window problem, a disk problem, a CPU interrupt problem, a bad virtual NIC setting, a packet loss problem, or an application that sends data in tiny chunks. sysctl tuning only helps with some of those.
That is why I treat TCP/IP sysctl changes as controlled experiments, not magic performance recipes. Start with a baseline, change one small group of settings, test again, and keep notes. If you copy a giant tuning block from the internet into /etc/sysctl.conf, you may improve one workload and quietly hurt another.
The settings below are useful when you run high-throughput services: artifact repositories, backup servers, object storage gateways, busy reverse proxies, database replicas shipping large logs, or Linux hosts moving traffic across long-distance links. They are less likely to help if your bottleneck is TLS encryption CPU, slow storage, application locking, cloud provider limits, or packet loss outside the host.
Before changing anything, collect a quick baseline:
ip -s link
ss -s
nstat -az | egrep 'TcpRetransSegs|TcpExtTCPLoss|TcpExtTCPTimeouts|TcpExtListenOverflows'
sar -n DEV,TCP,ETCP 1 10
iperf3 -c test-host -P 4 -t 30
If you see retransmits climbing, fix loss before raising buffers. If CPU is already pinned in top, mpstat, or perf, sysctls may hide the symptom but not remove the bottleneck. If iperf3 is fast but your app is slow, look at the app path before tuning the kernel.
How sysctl Fits Into Network Tuning
sysctl exposes kernel parameters while the system is running. Network settings usually live under net.ipv4, net.ipv6, and net.core. You can read a value like this:
sysctl net.ipv4.tcp_congestion_control
sysctl net.ipv4.tcp_rmem
sysctl net.core.rmem_max
You can test a temporary change like this:
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
Temporary changes disappear after reboot. Persistent changes belong in a dedicated file such as /etc/sysctl.d/90-network-throughput.conf, not scattered through /etc/sysctl.conf with no explanation.
sudo install -m 0644 /dev/null /etc/sysctl.d/90-network-throughput.conf
sudo editor /etc/sysctl.d/90-network-throughput.conf
sudo sysctl --system
Use a separate file because rollback is simple: move the file away and run sudo sysctl --system again. That matters when a setting behaves badly under production traffic.
TCP Buffers: Give Long Connections Room to Breathe
The first place people look is buffer sizing. TCP needs enough send and receive window space to keep data in flight while acknowledgments travel across the network. The useful mental model is bandwidth-delay product: a high-bandwidth, high-latency connection needs more outstanding data than a low-latency LAN connection.
For example, a 1 Gbps transfer across a 1 ms data center path needs far less in-flight data than a 1 Gbps transfer across a 70 ms WAN path. If the receive window is too small, the sender pauses even though the link has room.
Linux uses three-value arrays for TCP memory tuning:
net.ipv4.tcp_rmem = 4096 131072 33554432
net.ipv4.tcp_wmem = 4096 131072 33554432
The three numbers are minimum, default, and maximum per-socket buffer sizes in bytes. The exact values should match your workload, memory budget, and kernel behavior. The example above raises the maximum to 32 MiB, which is often enough for busy servers without being reckless. Some long-haul or storage-heavy systems use larger values, but that should be tested with real traffic.
The net.core limits cap socket buffers:
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
If tcp_rmem says TCP may grow to 32 MiB but net.core.rmem_max is much lower, the lower cap wins in practice. Keep the caps aligned with the TCP maximums unless you have a reason not to.
Do not raise buffers blindly on a machine with many concurrent connections. A file server with a few large flows can afford bigger per-flow buffers. A proxy handling hundreds of thousands of connections can burn memory quickly if you make every socket eligible for huge buffers.
Autotuning Is Already Doing Some Work
Modern Linux kernels already autotune TCP buffers. That means you usually do not need to set huge fixed socket buffers in the application. The kernel grows buffers when a connection benefits from more space.
Your job is mostly to make sure the ceiling is not too low. If throughput is poor on a long fat network and ss -tin shows small receive windows or a sender blocked by the receiver, raising tcp_rmem, tcp_wmem, rmem_max, and wmem_max can help.
Check active connections with:
ss -tin dst <peer-ip>
Look for fields such as cwnd, rtt, rto, bytes_acked, bytes_received, and retransmission counters. They tell a better story than a single speed test.
Congestion Control: CUBIC, BBR, and Reality
The congestion control algorithm decides how TCP grows or shrinks its sending rate. On many Linux systems, CUBIC is the default and works well for general internet and data center traffic. BBR can improve throughput and latency on some lossy or long-distance paths because it models bottleneck bandwidth and round-trip time instead of reacting only to packet loss.
Check available algorithms:
sysctl net.ipv4.tcp_available_congestion_control
sysctl net.ipv4.tcp_congestion_control
Enable BBR only if your kernel has it available:
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
For persistence:
net.ipv4.tcp_congestion_control = bbr
Some systems also require the fair queuing packet scheduler for good BBR behavior:
net.core.default_qdisc = fq
Do not assume BBR is always faster. It can change fairness with other flows, and different BBR versions behave differently across kernels. Test it on the same traffic pattern you care about: many small API calls, a few bulk transfers, replicated database traffic, or mixed production-like load.
Listen Queues: Fix Drops Before They Become Mysteries
Throughput problems sometimes show up as connection failures during traffic spikes. If a service accepts new TCP connections slower than clients create them, the kernel queues fill up.
Relevant settings:
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
somaxconn caps the completed connection backlog requested by applications through listen(2). tcp_max_syn_backlog affects half-open SYN queue capacity. Raising them can help busy web servers, proxies, and load balancers, but the application must also request a large enough backlog. Nginx, HAProxy, Envoy, and application servers often have their own backlog settings.
Watch for overflows:
nstat -az | egrep 'ListenOverflows|ListenDrops|Syncookies'
ss -ltn
If ListenOverflows climbs, kernel queues are not keeping up. If CPU is saturated or the app is blocked on downstream services, increasing queue sizes may reduce client errors briefly but will not fix the service.
Backlog and Packet Processing
net.core.netdev_max_backlog controls how many packets can wait on the input queue when the kernel receives packets faster than it can process them.
net.core.netdev_max_backlog = 250000
This can help on high-speed interfaces during bursts, especially with virtualized networking. It can also add latency if you turn the host into a large packet waiting room. Check interface drops first:
ip -s link show dev eth0
ethtool -S eth0 | egrep 'drop|err|timeout|miss|fifo'
If driver-level drops are climbing, also inspect NIC ring sizes, interrupt distribution, RSS queues, and CPU affinity. Those are outside sysctl, but they often matter more than TCP buffers on 10 Gbps and faster hosts.
TIME_WAIT and Port Exhaustion
High-throughput clients, proxies, and job runners may run out of ephemeral ports or accumulate many sockets in TIME_WAIT. Be careful here because old tuning advice can be harmful.
Check the current range:
sysctl net.ipv4.ip_local_port_range
ss -tan state time-wait | wc -l
A reasonable client-side adjustment is widening the ephemeral port range:
net.ipv4.ip_local_port_range = 10240 60999
Avoid old advice that recommends tcp_tw_recycle; it was removed from Linux because it broke valid traffic, especially behind NAT. tcp_tw_reuse exists on many kernels, but its behavior has changed over time. Do not enable it as a default throughput setting. If you think you need it, test your exact kernel and traffic pattern carefully.
For servers, a pile of TIME_WAIT sockets is often normal. For clients, port exhaustion usually means you need connection pooling, keep-alive, HTTP/2, fewer short-lived outbound connections, or more source IPs.
A Conservative Starting File
Here is a practical starting point for a high-throughput server. It is intentionally not extreme:
# /etc/sysctl.d/90-network-throughput.conf
# Larger TCP autotuning ceilings for high-bandwidth or higher-latency paths.
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 131072 33554432
net.ipv4.tcp_wmem = 4096 131072 33554432
# Larger queues for bursty inbound traffic.
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
net.core.netdev_max_backlog = 250000
# Optional: test before enabling globally.
# net.core.default_qdisc = fq
# net.ipv4.tcp_congestion_control = bbr
Apply it:
sudo sysctl --system
Then measure again. Use the same test size, time window, number of parallel streams, and network path. A before-and-after test that changes five variables is not evidence.
Common Mistakes
The first mistake is tuning on a lossy path. TCP sees loss as congestion. Bigger buffers may raise throughput a little, but they can also increase latency and hide the real problem. Fix bad cables, overloaded virtual switches, packet policing, MTU mismatch, and flaky VPN paths first.
The second mistake is assuming iperf3 -P 8 proves application performance. Parallel streams can fill a link even when one real application connection cannot. That is useful information, but it is not the whole story.
The third mistake is setting enormous buffers on shared hosts. Bigger ceilings are fine when the kernel grows buffers only as needed, but memory pressure changes everything. Monitor free, slabtop, TCP memory, and application memory after changes.
The fourth mistake is forgetting rollback. Keep the previous values in your change ticket or runbook:
sysctl -a | egrep 'net.core.rmem_max|net.core.wmem_max|net.ipv4.tcp_rmem|net.ipv4.tcp_wmem|net.ipv4.tcp_congestion_control'
When sysctl Is Not the Fix
If one CPU core is pegged while others are idle, look at interrupt handling, RSS, RPS/XPS, and application threading. If disk wait is high, the network may be waiting on storage. If TLS consumes CPU, test with and without encryption and consider hardware, cipher choice, or connection reuse. If Kubernetes or a cloud load balancer sits in the path, check service-level limits and conntrack tables.
For NAT-heavy hosts, also inspect conntrack:
sysctl net.netfilter.nf_conntrack_count
sysctl net.netfilter.nf_conntrack_max
That is not TCP throughput tuning, but conntrack exhaustion can look like random network slowness or dropped connections.
Testing Without Fooling Yourself
Use iperf3 as a network tool, not as proof that the application is fixed. A single stream test is useful because it shows what one TCP connection can do:
iperf3 -c test-host -t 30
A parallel test shows whether the link can be filled by multiple flows:
iperf3 -c test-host -P 8 -t 30
If parallel streams are fast but one stream is slow, look at congestion control, TCP window growth, RTT, and packet loss. If both are slow, look lower: interface errors, cloud bandwidth limits, CPU saturation, MTU, firewall inspection, or storage behind the sender and receiver.
Keep the test path realistic. Testing two hosts in the same rack will not tell you much about a backup job crossing regions. Testing with a tiny file will not expose steady-state throughput. Testing through a VPN may measure the VPN appliance more than Linux TCP.
After each change, capture the same counters:
nstat -az > /tmp/nstat-after.txt
ss -s
sar -n DEV,TCP,ETCP 1 10
The useful result is not just "the number went up." You want to know whether retransmits fell, queues stopped overflowing, CPU stayed reasonable, and latency did not become worse for smaller requests.
Good Linux network tuning is measured and reversible. Raise TCP buffer ceilings when the path needs larger windows. Test congestion control instead of assuming one algorithm wins everywhere. Increase listen queues when you see queue drops, and fix the application if it cannot accept fast enough. sysctl is useful, but it is one layer in a bigger system.