Post mortem API outage
We are really sorry for the outage we had today. Our team is working very hard on tackling the cause of this problem, as well as fixing things that were not working the way they supposed to. So let us take you through what happened, why it happened and what weʼre going to do to prevent this in the future.
AWS Load balancer connections dropped
As you may know, your Homey Pro is connected to our cloud servers so that you can control your Homey when youʼre not on your local network. Problems started at 03:00 UTC. All load balancer connections from the Homeys to our servers were dropped suddenly. Then all Homeys started reconnecting simultaneously to the servers (see graph).
And for every Homey that connects we need to check which Homey it is by validating it against the database through an API call. This caused a massive load on the API server and on the database of course. We scaled up automatically but even that could not help. Once requests are slowing down and retries are coming in things get worse pretty quickly. In the end things were resolved by scaling up manually even further (both horizontally as well as vertically). This released some of the pressure and things were coming back to normal.
Local connectivity
As you probably know, our advertising message clearly says: ‘Locally connectedʼ. This is one of our pillars on which we build Homey: privacy, security and locally connected. Unfortunately one of these pillars failed on us. When you opened the mobile app this morning, chances were pretty high that you saw an error message (Homey Offline or API timeout). We were aware of this problem in prior releases of the app and in our latest major release version 8.0.0 we thought we had tackled this completely. We overlooked the fact that when the API server goes offline, our weather service experiences a specific error that our mobile app failed to handle correctly.
Problem summary
- AWS drops connections (still investigating why this happens)
- No exponential back-off for reconnecting in Homey Pro
- No rate-limit on connecting Homey Proʼs
- Bug in v8 app when API server is down
Immediate actions taken
- Add exponential backoff to Homey Pro
- Add rate-limit to our connect servers
- Fix bug in v8 app
Long term actions
- Add more testcases to the mobile app for local connectivity
- Improve customer communication in case issues like this arise
Again, we are really sorry this happened and we are doing everything we can to prevent issues like this from happening again!