AWS Host Level Degradation Causing Service Interruption
Resolved
Feb 23 at 08:07am CST
Instance became intermittently unresponsive. SSH access was timing out, EC2 Serial Console failed to connect, and stop operation was unusually slow. Status checks showed insufficient data during the event.
Initial investigation after regaining SSH showed no memory pressure, no swap usage, and no disk errors. CPU credit balance remained healthy at over 500 credits. However, vmstat revealed high CPU steal time, peaking above 50 percent, indicating significant hypervisor level contention.
No evidence of OOM killer events, filesystem corruption, or application level resource exhaustion was found. dmesg after reboot confirmed unclean shutdown but no underlying disk or kernel faults.
Resolution was achieved by stopping and starting the instance, forcing placement on new AWS hardware. After restart, CPU steal time returned to zero and system metrics normalized.
Root cause appears to be underlying AWS host level contention or degradation rather than application or configuration related.
Affected services