Partial +Stream server outage post-mortem - Part II - Bytesized Hosting
Partial +Stream server outage post-mortem - Part II
tl;dr: Overdraw of power caused a handful of servers to trip the breaker because of power maintenance in the datacenter. The problem is solved by applying a hack and will be fixed permanently in the coming weeks (which requires one more maintenance window for one server). Everybody received one week free of service as compensation for the hours of downtime.
On Friday and Saturday night we had two outage windows that took down a handful of our +Stream servers. I've spend most of the day yesterday in contact with the datacenter engineers to find out why this is happening and what we can do about it and luckily we found the problem and the solution. If you are interested in the technical aspect read on; otherwise stick with the tl;dr above.
The problems we encountered were very similar to the problems we encountered earlier, see our previous outage post.. There was again a maintenance window on the power-circuit at our datacenter, one that should not have affected us in any way because every server is powered by two feeds but again we noticed servers were going offline during the window. Last time the datacenter engineers couldn't find the root cause. But yesterday we finally did. It appears that when we lose one of the two feeds the other feed is going over the ampere limit and hitting the breaker. Normally our cabinet limit is 32 ampere, 16 ampere from the A feed and 16 ampere from the B feed. However during maintenance when either one of the feeds falls through the 16 ampere limit is not enough for the servers connected and it trips the breaker. We’ve tried to re-cable to servers to spread out the load more and it appeared to have worked but sadly after an hour of single feed power the breaker tripped again. As a temporarily work around we made one of the servers draw it's power from our second cabinet which has plenty of power left. In the next few weeks we will have to schedule an other maintenance window for one of the servers so we can move it to the other rack permanently and solve these issues properly.
If our datacenter didn't do these maintenance window on their power-circuit we would have never found out there was a problem. The chance of a feed losing power are very slim since all the power is protected by generators. The maintenance window is probably the only way this event could have happened; sadly it did. We should be protected against any further power related issues in the future so hopefully downtime we saw on multiple servers should be a thing of the past.
Everybody that was affected by these events will have a free week of service added to their accounts shortly to make up for the hours of downtime.
Sorry for any inconvenience caused.
Please sign-up for an account to join the discussion.