Partial +Stream server outage post-mortem - Bytesized Hosting
Partial +Stream server outage post-mortem
I quickly wanted to write-up yesterday's, the 25th of March, events that led up up to a few hours of down time for several +Stream servers.
I had been notified for quite some time now that there would power maintenance in our datacenter hosting the +Stream servers. This maintenance would involve shutting down one of the feeds powering our servers. This should in theory be a zero downtime endeavour since all our servers and network equipment is hooked up with two power feeds. What's more; our maintenance window was planned for today, the 26th of March, not the 25th.
I was caught quite off guard when I noticed one of the servers going offline around 20:00. Since I was unable to reach the datacenter myself I quickly asked for remote hands and eyes to power cycle the server. I quickly received a call saying that the server was hooked up twice to the same power feed, doh, and because of the maintenance the server had gone without power. When I brought up that we were not scheduled until tomorrow the engineer I spoke to convinced me that side effects like these could have happened without going into the specifics.
Around 22:00 GMT+1 even more servers went offline, the engineer had already confirmed all the other servers had their power hooked-up properly so this would have meant something else happened with the power. Luckily it seems the servers had only rebooted as they were quickly online again. However at around 01:00 something happened that caused the same servers to shut down without coming back online.
I quickly dispatched an other hands and eyes engineer to see what was going on. I received word around 3:31 (time changed so it was actually 2:31) that and I will quote them "We checked your rack and found that the circuit breakers on the PDU’s had tripped, we think that it is a coincidence that it happened during the maintenance that we performed on the 25th of March"
At this time I had two separate parties confirm it was most likely related to the power maintenance and then one that said it was a total coincidence. The servers have never had any power problems before and during the window they were working on the power this happens. It feels too much of a coincidence not to be related somehow.
Customer relations has told me they will contact me today to explain what might have happened on their end, if I learn anything more I will update this news post.
If you are having issues still then make sure you try a round of app reboots, if the problems persist please raise a ticket. We will try to get to you as soon as possible but you can understand there is quite some backlog.
Sorry for the inconvenience that might have been caused by this outage.
Please sign-up for an account to join the discussion.