How higher traffic and sales exposed hidden performance issues and bugs
Before we dive into this article it is important to note that we never write poorly optimised code intentionally. No matter how good your work is, there are always new methods to consider and code can slow down within a short period of time. With that in mind, let’s get to it!
2020 has been a very strange year for everyone and some sites have seen a huge uplift in traffic as a result of this. When sites get a lot more traffic to them, it can expose some weaknesses in the codebase which you weren’t previously aware existed. The code you wrote is fine and it still works, but it’s probably a few years old since the site went live. New features may have been added over time but the core logic often remains unchanged. Sometimes if it’s not broken, it’s best not to fix it, but not always!
When you get a sustained period of higher than usual traffic and the site starts to slow down, it’s always worth investigating why. I’m going to take you through a scenario we encountered this year and the changes we made to rectify it. By the way, the answer isn’t to just upgrade the server. When you upgrade the server and you’re still hitting issues, it’s better to investigate why and not keep throwing more resources at the problem.
Context is key
First of all I’m going to set the scene for you. We have a customer with a site which provides online bookings, and you can also purchase on site by just walking up on the day. Suddenly the human malware struck at the start of the year and the world was turned upside down. We used the lockdown period to overhaul the booking system for this customer. Instead of a ticket being valid for a day, it was now valid for a specific time slot, even down to a colour coded block of seats. When lockdown started to be lifted around the beginning of July, walk up bookings were no longer allowed and all ticket purchases had to be made through the website.
During lockdown we had been running test bookings, data exports, etc and making sure everything worked as needed before going live with the updates to the system. The new process was ready to go live, but at its core it still used the same old custom logic in some areas. This would have worked under normal traffic levels, as it had done so already for a number of years. Unknown to us though, the core logic had some pitfalls which were exposed by the higher levels of traffic, and we fell straight into them! It’s important as a developer to be honest about the code you build and to learn from all situations, good or bad.
Problems with Popularity
To give an example of one of these pitfalls, availability was suddenly a big factor. Each bookable time slot and seating block had a set number of tickets available. This was due to the need for social distancing and being unable to add more seats. The availability logic under regular traffic levels for the site would have been able to cope with this. Under the old logic, the ticket selections available were updated after another booking had been made. This is done automatically at the point of payment in a similar way to how stock for a physical product would be done.
When we went live and bookings were being accepted we saw tickets being purchased at the same time by multiple customers. This was a side effect of increased traffic levels, the fact that all bookings were now being made through the website, and the additional effect of everybody being sick of being stuck in the house for most of the year. If you have 2 tickets available for a certain selection and 5 people all have that ticket in their basket at the same time, then you can start to see that we have a problem on our hands.
The checkout would succeed for all of these people within seconds of each other. Normally we would have time on our side for the system to update the availability of the tickets before the next purchase attempt was made. This would then prevent the ticket from being overbooked, but this was being bypassed by the sheer volume of traffic. They were selling at the same time which meant overbooking was an issue. I’ve used this scenario to demonstrate what we were seeing, now rather than tell you how we resolved this, I’d like to take you through the impact this had on the site as a whole instead.
Impacts on Performance
On this particular site, we have built custom database tables to hold information such as the booking, the customer, the line items, the products, and a number of the more complex product attributes and timetabling. There are a few different combinations where some tables are used and others aren’t. Joining the queries would’ve meant we had missing data where an entry doesn’t exist in one of the secondary tables. As a result, instead of joining our queries, we’ve got small queries inside loops. An example of this in action would be if we wanted a query to return all orders placed this week. We are looping through each row returned and running additional queries for the related data of the customer, product, etc.
This worked perfectly well at the beginning but due to the sheer number of bookings taking place through the website (nearly 24 times the amount compared to the same month last year) we started to have some very large datasets to loop through. This made the site become very slow when it was performing this operation. When we monitored the database queries we just saw that they were small and all completing quickly, so we thought there wasn’t an issue here. In reality though, whilst they are small queries, it was the sheer amount of them being run which caused runtime issues. These occurred when the server was processing the loops, conditionals, etc to generate the output of the page.
A Simple Solution
To combat our issue, we used a very simple trick which is to take the query out of the loop. We would now run our query before the loop, without where parameters such as an order id, and these would instead be used as an array key which we could target. By doing this we could get the data within the loop (using array key targeting) without executing another query. This very simple change saw our query count for the page drastically fall. The query outside the loop was larger than the one inside the loop, but when you are comparing 1 query against 15,000 small ones, it makes a huge impact!
Suddenly our pages were loading very quickly again regardless of the size of the dataset. This is important because we were using lots of these types of queries to check against booked tickets to calculate the remaining availability. As a result, this didn’t just speed up the backend pages, but also the booking process for the customers. By making this change we took our CPU usage on the server down from 80% to 20% during peak traffic. This now allows us to offer the customer a downgrade on the server if they wish to save on some of their hosting costs without any detrimental impact on performance.
I want to stress that the code originally in place had been running the website efficiently for a number of years with no indication of these weaknesses. This year has created exceptional circumstances in many areas and this was one of them. The code just didn’t suit the new requirements of the situation we find ourselves in today and a small refactor was needed.
Sometimes additional server resources will be needed to combat a problem, however this should never stop you from investigating further into the cause of that problem. If you can optimise your code to fix the original issue then the temporary server upgrade can be reversed and you will be confident that the issue has been solved. This will reduce the likelihood of future issues from occurring because the offending bug has been dealt with correctly. That being said, sometimes a server might just really need the extra resource to be able to cope with the demands you’re placing on it.