Troubleshooting Common EC2 Instance Connectivity Issues and Errors
Learn to rapidly diagnose and fix common EC2 connectivity failures for SSH and RDP. This practical guide walks you through checking instance health, verifying crucial Security Group rules, troubleshooting stateless Network ACLs, and confirming VPC routing configurations to restore immediate access to your instances.
Troubleshooting Common EC2 Instance Connectivity Issues and Errors
When an EC2 connection fails, the first useful question is whether the instance is unreachable, rejecting authentication, or reachable only through the wrong path. Whether you are using SSH for Linux instances or Remote Desktop Protocol (RDP) for Windows instances, connectivity failures are common and often frustrating. SSH and RDP errors tend to blur together, but Permission denied, Connection timed out, Connection refused, and a blank RDP screen point to different layers. Treat the error text as a clue, then work from the outside inward.
Phase 1: Initial Checks and Instance Health
Before diving into complex network configurations, ensure the instance is running correctly and reachable at a fundamental level.
1. Instance Status Checks
Use the AWS Management Console or the AWS CLI to verify the instance's overall health. Two crucial checks must pass:
- System Status Checks: Failures here usually indicate underlying hardware or infrastructure issues that require AWS intervention or instance termination/recreation.
- Instance Status Checks: Failures here often relate to operating system boot issues, file system corruption, or driver problems. If this fails, the instance is likely unhealthy enough to reject network connections.
Action: If either check fails, consider stopping and starting the instance (which moves it to new hardware if the system check fails) or checking the System Log for clues.
2. Verifying the Public IP Address and DNS Name
Ensure you are attempting to connect to the correct address. If your instance must be reached directly from the internet, it needs a Public IPv4 Address or an Elastic IP and a public subnet route through an internet gateway. If it's in a private subnet, you must connect via a Bastion Host or use AWS Systems Manager Session Manager.
- Tip: If the instance was stopped and started, its public IP address may have changed unless you assigned an Elastic IP.
3. Checking Client Configuration (SSH/RDP)
Connectivity errors are sometimes local. Verify that your client software is functioning correctly.
- For SSH (Linux/macOS): Ensure you are using the correct private key file (
.pemor.ppk) and that the permissions are correctly set (chmod 400 /path/to/key.pem). - For RDP (Windows): Ensure you are using the correct password obtained by decrypting the administrator password using the private key file in the EC2 console.
Phase 2: Security Layers Diagnostics (The Most Common Failures)
Security misconfigurations are the leading cause of connectivity problems. Both Security Groups and Network ACLs act as firewalls, and both must permit the necessary traffic.
4. Security Group (SG) Ingress Rules
Security Groups are stateful firewalls attached directly to the instance's Elastic Network Interface (ENI).
Linux (SSH) Requirements:
- Protocol: TCP
- Port Range: 22
- Source: Your public IP address (
My IP) or0.0.0.0/0(for all IPs, though this is discouraged for security).
Windows (RDP) Requirements:
- Protocol: TCP
- Port Range: 3389
- Source: Your public IP address or
0.0.0.0/0.
Troubleshooting Step: Temporarily change the source of the required ingress rule to 0.0.0.0/0 for the relevant port (22 or 3389). If you can connect, the issue was that your specific client IP address was blocked or not correctly identified.
Warning: Never leave security groups open to
0.0.0.0/0for management ports (22/3389) in production environments. Use specific source IPs or VPC endpoints where possible.
5. Network ACLs (NACLs)
Network ACLs are stateless, subnet-level firewalls. They check both inbound and outbound traffic independently. If traffic is allowed in, the return traffic must also be allowed out.
NACL Requirements for Connectivity:
| Direction | Protocol | Port Range | Rule Action |
|---|---|---|---|
| Inbound | TCP | 22 (SSH) or 3389 (RDP) | Allow |
| Outbound | TCP | Ephemeral Ports (1024-65535) | Allow |
Ephemeral ports are critical. When your client connects (e.g., from port 54321), the server replies on a high-numbered ephemeral port. If the NACL blocks outbound traffic on these high ports, the server cannot send the response back to you, resulting in a connection timeout.
Troubleshooting Step: Verify that both the inbound port (22/3389) and the outbound ephemeral ports (1024-65535) have an Allow rule in the associated NACL.
Phase 3: Routing and VPC Configuration
If security layers are confirmed open, the issue lies in how traffic is routed to and from the instance's subnet.
6. Subnet Type and Route Tables
Connectivity depends entirely on whether your instance is in a Public Subnet or a Private Subnet.
Public Subnet Connectivity
For direct internet access (SSH/RDP from the outside world):
- The instance must be assigned a Public IPv4 address or Elastic IP.
- The associated Route Table must have a route for
0.0.0.0/0pointing to an Internet Gateway (IGW).
Private Subnet Connectivity
Instances in private subnets cannot be reached directly from the internet. Connection requires a multi-hop path:
- Connection via Bastion Host (Jump Box): You SSH into a public EC2 instance, and then SSH from the Bastion Host to the private instance (using its Private IP).
- Connection via VPN/Direct Connect: If using AWS Site-to-Site VPN or Direct Connect, routing must be configured to direct traffic to your on-premises network, which then routes to the private subnet.
7. OS-Level Firewall Issues
If AWS security checks pass, the operating system running on the EC2 instance itself might be blocking the connection. This is common if you manually installed or configured local firewalls (like iptables on Linux or Windows Defender Firewall).
Diagnosis (If possible via Console or Session Manager):
- Linux: Check
iptables -Lor usefirewall-cmd --list-all. Ensure port 22 is explicitly allowed. - Windows: Check Windows Defender Firewall settings for inbound rules on port 3389.
Recovery Tip: If you have lost all connectivity, consider stopping the instance, detaching the root volume, attaching it to a functioning recovery instance, modifying the OS configuration files to disable the firewall, and then reattaching the volume to the original instance ID.
Public, private, and managed connection options
Do not assume every EC2 instance should accept SSH or RDP from the internet. Public instances need a public address, a route to an internet gateway, permissive security controls, and a running listener. Private instances usually need a different access method: a bastion host, VPN, Direct Connect, EC2 Instance Connect Endpoint, or Systems Manager Session Manager.
Session Manager is especially useful for operations teams because it can remove the need for inbound SSH. The instance needs the SSM agent, an IAM instance profile with the right Systems Manager permissions, and network access to SSM endpoints. In private subnets, that usually means VPC interface endpoints or outbound internet through a NAT path. If any of those pieces are missing, Session Manager will not appear as an option even though the instance itself is healthy.
For a bastion design, test both legs. First connect from your workstation to the bastion. Then connect from the bastion to the private IP of the target instance. The target instance security group should usually allow SSH only from the bastion security group, not from your home IP and not from the whole VPC CIDR unless you have a reason.
For RDP, remember that Windows boot can take longer than Linux SSH startup, especially after patching or first launch. If the instance status checks have just passed but RDP still fails, check the system log and wait a few minutes before changing firewall rules. Repeatedly replacing security groups can hide the actual boot or service issue.
Quick tests from your workstation
Use small network tests before changing AWS resources. From Linux or macOS, nc -vz <public-ip> 22 tests whether TCP port 22 completes. For RDP, use nc -vz <public-ip> 3389 or a port test tool from Windows. A timeout points toward routing, security groups, NACLs, or an upstream firewall. A refused connection points more toward the instance OS or service.
If DNS is involved, resolve it explicitly:
dig +short ec2-203-0-113-10.compute-1.amazonaws.com
Then compare the result with the current public IP in the EC2 console. Elastic IPs stay stable, but auto-assigned public IPs can change after stop/start. This is a simple cause of broken runbooks and saved RDP profiles.
If you use a corporate VPN, test from another network before editing the VPC. Some company networks block outbound SSH or RDP, and some home routers or ISPs interfere with uncommon ports. A successful connection from a different network tells you the instance may be fine.
VPC Reachability Analyzer is worth using when the route is not obvious. It can model a path between a source and destination and point out where routing or filtering blocks traffic. It will not fix a bad SSH key or a stopped service inside the guest OS, but it is helpful for separating AWS network design problems from operating system problems.
Flow logs can also help, especially when NACLs or security groups are suspect. A rejected flow from your client IP to port 22 or 3389 tells you the packet reached a monitored network interface or subnet and was denied. No flow at all may mean the traffic never reached the VPC, the address is wrong, or you are looking at the wrong ENI, subnet, or time window.
Keep a small access runbook for each environment: approved source IP ranges, bastion name, SSM requirements, default usernames by AMI, and the recovery instance procedure. Connectivity incidents get slower when every engineer has to rediscover those details from the console.
Also record which subnets are intentionally private. That single note prevents a lot of wasted debugging when someone tries to SSH directly to an instance that was never designed to have an internet path.
Reading the error message
Connection timed out usually means packets are not completing the trip. Look at public IP, route table, internet gateway, security group source, NACL rules, corporate firewall, and whether you are trying to reach a private subnet directly.
Connection refused usually means the network path reached the instance, but nothing is listening on that port or the OS rejected it. On Linux, check whether sshd is running and listening on port 22. On Windows, check whether RDP is enabled and the Remote Desktop service is running.
Permission denied (publickey) is not a VPC problem. It usually means the wrong username, wrong private key, missing public key in authorized_keys, changed home directory permissions, or an AMI username mismatch such as using ec2-user for an Ubuntu image instead of ubuntu.
For Windows RDP, authentication failures often come from using an old decrypted administrator password after the instance was replaced, connecting to the wrong public IP after a stop/start, or domain policy overriding local login rights.
Recovery paths when you cannot log in
If the instance has the Systems Manager agent installed, an instance profile with SSM permissions, and network access to SSM endpoints or the internet, Session Manager is usually the least disruptive recovery path. You can inspect logs, fix firewall rules, or repair authorized_keys without opening SSH to the world.
If SSM is unavailable, use EC2 serial console where supported, or detach the root volume and attach it to a recovery instance in the same Availability Zone. Mount it carefully, fix the network or SSH configuration, unmount it, and reattach it to the original instance. Take a snapshot first so a repair attempt does not make recovery worse.
When connectivity fails, follow this prioritized checklist: instance health, correct address, correct username/key or RDP password, security group, NACL, route table, OS firewall, and service health. That order keeps you from changing five AWS controls when the actual problem is one stale key or one missing route.