Post-mortem of +Stream downtime - Bytesized Hosting

Post-mortem of +Stream downtime

Hey guys,

This news post describes the events that led to the four hours of downtime on our new +Stream servers on the 29th of June 2016. It's fairly technical but it might be interesting to some of you.

tl;dr Replacing and reconfiguring switches to repair our network issues caused four hours of downtime. Initial reports seem to suggest the network issues have been solved since we made the changes. So Yay!

Yesterday we posted our first findings about the network issues that we uncovered on our new +Stream servers the last few days. In order to try and solve the issues we went to the datacenter last night to reconfigure our switch together with our upstream provider which resulted in downtime.

First some details about the networking setup so the rest of the post makes more sense. We have two networks going through our upstream provider. One is what we call an out-of-band (OOB) network that exists so we can reach the servers even when the main network is down. The other network is our public network, our gateway to the internet.

Earlier this week, around the same time we first noticed the disconnect issues, we lost connection to our OOB network. This was annoying since I did a kernel upgrade that stalled and without a OOB network this means driving to the DC to manually fix it. Once there we noticed the server was stuck configuring the network for the OOB so we unplugged the server from the OOB all together which solved the problem with booting it up.

After the network interruptions became more noticeable and after talking with our upstream provider it was decided we should try and remove the OOB network configuration completely from our switch since it looked like all the issues might be coming from there. That was the first thing we did when we entered the datacenter last night. We connected to our switch and removed most of the extra configuration needed for the OOB while our upstream provider did the same on his end.

However, this made things worse, the servers lost internet connection during this time completely. Something we did not expect. After we tried all changes we could on the switch to get it back our upstream provider suggested the switch might be failing all together. Luckily they had a spare switch that they quickly configured so we could replace our failing one. This whole procedure took an hour or so but sadly after doing all that we still could not get any connection to our servers. Up until that moment our switch was plugged in via a cross-connect to a switch at our upstream provider. However since replacing the switch did not work our provider decided to bypass his switch all together and plug our cross-connect straight into the core router. This did the trick and brought the network back.

We have been watching the network performance over the last day since yesterday's maintenance and we have had zero reported disconnects. It seems that the changes we made did solve things in the end but it's too early to tell for sure.

We will continue to monitor the situation and get back once we confirmed all the problems have been solved.

Thank you for your understanding, hopefully it's smooth sailing from here on out!

Comments

Please sign-up for an account to join the discussion.