Back to overview

AWS Host Level Degradation Causing Service Interruption

Feb 23 at 08:07am CST
Affected services
ManageMemberships.com
Door Access
Texts
Emails
Payments
Portals
Sub Systems
Time Clock
ManageRegister
ManageShifts

Resolved
Feb 23 at 08:07am CST

Instance became intermittently unresponsive. SSH access was timing out, EC2 Serial Console failed to connect, and stop operation was unusually slow. Status checks showed insufficient data during the event.

Initial investigation after regaining SSH showed no memory pressure, no swap usage, and no disk errors. CPU credit balance remained healthy at over 500 credits. However, vmstat revealed high CPU steal time, peaking above 50 percent, indicating significant hypervisor level contention.

No evidence of OOM killer events, filesystem corruption, or application level resource exhaustion was found. dmesg after reboot confirmed unclean shutdown but no underlying disk or kernel faults.

Resolution was achieved by stopping and starting the instance, forcing placement on new AWS hardware. After restart, CPU steal time returned to zero and system metrics normalized.

Root cause appears to be underlying AWS host level contention or degradation rather than application or configuration related.