How do we know if a website has gone offline?
Website hosting. It’s a critical part of your digital infrastructure, much like air conditioning in a building. It’s out of sight and typically out of mind, but you very quickly know when it’s not working. At that moment, there’s anxiety and frustration that you could be losing sales, suffering damage to your brand, or worse still, you're a victim of a cyber attack.
We’ve always provided hosting for websites. Most of those sites we’ve built for clients whom we’ve got to know well. So if something goes wrong, it can be relatively easy and quick to identify a problem. But the Internet is so interconnected, it’s not always the website where the fault is.
How do we monitor website hosting?
Irrespective of where our servers are located or who manages the surrounding infrastructure, we have server monitoring attached to every site we’re responsible for. Even if we’re not actually ‘the host’, we’re typically the first point of contact for technical issues. There’s nothing worse for a digital agency that offers hosting to receive a call to tell you about a site that’s offline. We pride ourselves on a high level of service, so why should it fall to a client to tell us if their site is offline? It absolutely shouldn’t and in my view, that would be a sign of poor service to not have that base covered.
As you might imagine, given the importance of websites and their hosting infrastructure, there are a multitude of services that can measure and monitor uptime.
All our servers (or more correctly, domain names) have a monitor with a service called Pingdom. Pingdom, as the name suggests, ‘pings’ each domain at a given time interval from various locations around the world, checking the servers uptime, but how quickly it responds.
Our monitors are set to consider a site ‘down’ when there’s a timeout for longer than 30 seconds. If Pingdom detects a problem, then it ‘pings’ us to make sure we know about it.
How do you know when a site is offline?
In the moment when a site goes down, clients will often contact us to let us know. That’s totally fine, but we can reassure them all that we have no shortage of alerts to tell us as soon as a problem is identified!
When a domain fails to respond:
Pingdom sends a push notification to my phone
A webhook fires an alert into an #alerts channel on Slack that our entire team can see.
An email is triggered to those on our team with responsibility for investigating the issue.
If a domain is still offline after 5 minutes:
SMS messages are sent to my phone
Emails are triggered into our help desk system (again, notifying everyone on our team)
So in short, we’re very quickly surrounded by notifications and alerts, it would be almost impossible for them to be ignored. That’s perhaps overkill, but intentional to ensure a quick response.
Of course the reverse happens when a site comes back online, although this time each alert is far more satisfying to hear than the earlier ones.
It’s worth adding that it’s not just downtime monitors that can send us alerts. Our backup system is equally well protected to let us know if and why a backup couldn’t be completed. Similar alerts are triggered if there’s a surge in bandwidth usage on a server, or if disk space hits a certain threshold. We always need to be the first to know about these things otherwise we can’t exactly call it a managed service.
What happens when a website goes offline?
A few things can happen when our alerts start going crazy around us (which thankfully is fairly infrequent).
Often our first thought will be, can this be for real?!
We’ll go and check the site in a browser and see for ourselves what happens. There might be an error message to help us diagnose the problem, or the site could even be online. It’s not uncommon for false positives from time to time depending on where in the world the original alert was triggered. False positives will usually correct themselves pretty quickly, showing ‘up’ sometimes even before we’ve had a chance to see the ‘down’ alert. As I said, networks that make the internet are complex and run by multiple organisations. ‘Down’ could be the result of a momentary blip caused by any link in the chain that routes traffic to your site.
If there seems to be an issue, that’s where we need to dig in a little deeper. It could just have been a slow response from the server or the website - both of which can be investigated.
Often the first port of call is to see whether there’s any issues with our network providers. If they’re experiencing a problem, then it’s most likely the cause. Most providers have their own ‘live’ status pages which are easily found from a quick Google search.
If that doesn’t show anything to worry about, we’ll look at the status of our servers to look for things like high memory usage, CPU activity or data read/write rates. These could be triggered by a specific action on the site or a spike in traffic which can usually be pinpointed pretty quickly.
If not, then we need to turn our attention to the site, and the error logs will be our next stop which take us into specific functionality or sections of code.
So as you can see, it becomes a process of elimination to narrow down the root cause. Pingdom offers a basic insight as to where the chain broke down, but usually, that’s of limited help. Getting into the details of each infrastructure component will be more likely to reveal the issue and determine the most appropriate response.
What about Out of Hours Responses?
We recognise the Internet is a 24/7 machine and clients are dependent on their services being active every hour of every day. We agree on specific response times for individual clients, but we’d never ignore a message received out of hours until the next business day. Of course, there are times when we need to sleep - we’re a small agency and while our monitoring operates 24 hours a day, we don’t - we can’t.
So there’s a risk of extended downtime through the night (unless we’ve established a separate agreement), but that’s where our careful selection of providers and infrastructure comes into play. I’ve always said ‘cheap hosting is expensive hosting’, so by not being reliant on any single provider, we’ve minimised the risk to an acceptable level. After all, if we can implement this level of monitoring, it’s the very least that we’d expect of our suppliers. If it’s infrastructure related, they should know about a problem before we do, which gives us that peace of mind we can extend to our clients.