It was 2 AM when my phone started buzzing. Half-asleep, I grabbed it and saw an alert:
High error rate detected. Immediate action required.
I rushed to my laptop and opened the dashboard. Red. Everything was red. Our system, which had been running smoothly for months, was suddenly failing.
I scanned the logs. They made no sense… just a mess of errors, and unhelpful messages. Users were flooding support with complaints.
After hours of debugging, we found the problem… One bug had taken down an entire service. We fixed it, deployed it and everything was up and running again.
That night, I realised that we didn’t have good enough observability of our system.
Keep reading with a 7-day free trial
Subscribe to Blog for Engineering Managers to keep reading this post and get 7 days of free access to the full post archives.