Troubleshooting
Focus on structured approaches: check basics first, then deepen.
General troubleshooting flow
- Initial Checks: Resources (CPU, memory, disk), service status, logs, network connectivity.
- Tools: Use
top
orhtop
for real-time monitoring;df -h
for disk space;systemctl status <service>
for services. - Logs:
tail -f /var/log/syslog
orjournalctl -u <service>
for errors. Grep:grep error /var/log/nginx/error.log
. - Log Management: Use
logrotate
to compress/rotate logs. Config in/etc/logrotate.d/nginx
; force withsudo logrotate -f /etc/logrotate.d/nginx
. - Network: Test outbound:
ping 8.8.8.8
. Local access:curl localhost
orcurl -I <url>
. Ports:ss -tuln
.
Scenario 1: Nginx Server Down (Traffic Dead)
- First Steps:
- Check if service is running:
systemctl status nginx
. - Local test:
curl localhost
(confirms if listening on port 80/443).
- Check if service is running:
- If Running:
- Resources:
top
orhtop
for CPU/memory spikes. - Disk:
df -h
(ensure not full). - Logs:
tail -n 200 /var/log/nginx/error.log
.
- Resources:
- Network/Firewall:
- Routes:
ip route
. - Firewall:
iptables -L
(check if port 80 blocked; fix withsudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
).
- Routes:
- If Local Works but External Fails: Firewall, routing, or security group issue.
Scenario 2: Latency Spiked (CPU Green in CloudWatch)
- First Steps:
- SSH in:
htop
to confirm no hidden spikes (CloudWatch might lag). - Logs:
tail -n 200 /var/log/nginx/error.log
(look for slow backends like DB issues).
- SSH in:
- Deeper Dives:
- Process tracing:
strace -p <PID>
to see what the process is stuck on. - Avoid restarts initially—masks root cause.
- Check for huge logs: Rotate if needed.
- Process tracing:
- Other Causes: Network congestion, external dependencies (e.g., slow API calls).
Scenario 3: EC2 Can't Reach S3 (No Alarms)
- AWS Console/CLI:
- Instance status:
aws ec2 describe-instances --instance-ids <id>
(confirm running). - SSH in: Test internet:
ping google.com
.
- Instance status:
- Network Checks:
- Subnet type: Public (has Internet Gateway route); private needs NAT.
- Security Groups:
aws ec2 describe-security-groups --group-ids <sg-id>
(ensure outbound allowed).
- IAM: Check attached role/policies for S3 access (e.g.,
s3:GetObject
). - If No Internet: Fix routes or add NAT Gateway.
Preparation Tips
- Simulate: Use free tier EC2 or local VM. Practice under time (20-min sessions).
- Mindset: Verbalize steps aloud. Avoid quick reboots—focus on root cause.
- Common Traps: Misconfigured security groups, private subnets without NAT, missing IAM permissions.