Being a software engineer can have its bittersweet moments. Sometimes you write code and everything works once; other times you spend days debugging an otherwise ‘small code’ only to figure out the problem all along was an invisible leading or trailing whitespace. In some cases you simply restart your system and BOOM, code is working! Talk about village people yeah. It goes without saying that your approach of problem-solving, coupled with your level of patience and agility, are key ladders in getting over that Trump wall.
The real software engineers, especially the ones who love to play around in the backend, know that the true test of your application is in the production environment after you have deployed the application for your customers or end-users. Unfortunately, this is also the environment where the most interesting problems occur. I am sure you may have noticed (assuming you are a real engineer) that there is often no amount of planning or testing that can guarantee 100% freedom from íssues’ in production. Production is to test what Thanos is to Hulk, and certain issues require an assembly of avengers to get through.
Certain production problems are sent from the pit of hell, and we experienced this from the front row while recently delivering on an implementation for one of the top telcos in Nigeria. There was chaos; there was confusion; there was drama; there were prayers; there were sleepless nights and tired faces; there were tons of intense brainstorming and development sessions to stir up new ideas; there were experiments and server restarts. Simply put, the entire scenario had the makings of a horror movie. We had everything but solutions.
“We cannot solve our problems with the same thinking we used when we created them.” — Albert Einstein.
Our problems were aplenty. Following our migration to a new set of infrastructure, we noticed the resources (CPU and RAM) on our windows servers were spiking frequently with no obvious increase in traffic requests from the end-users. This spike in resource utilization was typically followed by servers grinding our applications to a halt. It did not stop there. The database which houses all the subscriber data was acting up and experiencing a lot of locks. This is the human equivalent of stuffing your mouth up with an equal-sized orange and trying to drink water at the same time. We also observed that certain servers were experiencing more traffic than the rest. Other more stubborn servers simply refused to start up.
I wish that was all. The longer the problems stayed unresolved, the longer our reputation took a hit. The customer was furious, to say the least, and the onslaught of escalations from the field did anything but douse their frustration. How do you explain to your customer that users cannot log into the application you built for them and you do not know why? How do you explain that you had one job to ensure that customers were onboarded with ease and you somehow managed to mess it up even without trying? I still stutter at the thought of it.
Our first real step towards solving the problems decisively was coming together as a team, without privilege or assumptions, to define the problem and identify its symptoms. We brainstormed around each problem and identified the possible options for resolution.
The first item we opted to resolve were the spikes in resources especially during peak periods (9 am-5 pm). The first logical step was to ensure that the components which receive the traffic from the client were optimally tuned for high traffic load. The touchpoints here were the software load balancer which distributes the requests from the client and the application server which houses the applications servicing the requests. The former was fine, but the application server (called Wildfly) was still running on the default configurations which come out-of-the-box.
We had assumed all along that the app server tuning we usually apply to our app servers were in place. We later discovered that this was omitted during the initial deployment of the app server some months earlier due to a misconfiguration in the deployment pipeline. We tuned the application server by increasing the number of input/output (IO) threads available for processing requests on the server and also increasing the number of requests allowed by the application server for processing.
Finally, we increased the minimum and maximum heap sizes of the JVM (this is the part of RAM allocated to your app server for its memory needs), retaining the same values for both configurations in a technique called GC adaptive sizing. Adaptive sizing forces the garbage collector (GC) to run more frequently, thereby freeing up more memory space for your application to do more work.
The issue persisted! The spikes continued as CPU utilization consistently hit 100%. Initially our first thought was that this was purely a CPU-related issue, and had nothing to do with RAM. However, on combing through the server logs, we discovered a series of GC-related errors indicating that GC had exceeded its overhead limit. We discovered that the error is thrown when the JVM has spent a lot of time on garbage collection without reclaiming significant memory space.
This led us to believe that either we had a memory leak or the adaptive sizing configuration on the JVM was causing more harm than good. The latter meant that GC was occurring too frequently and contributing heavily to the CPU spikes. We proceeded to reconfigure the JVM heap sizes, setting a low value (2GB) for the minimum heap size and scaling the maximum heap size to a relatively high value (8GB). This new configuration brought some respite to the application server as the spikes drastically reduced, with the CPU utilization shuttling between 60% and 85%. We then applied recommended GC optimization settings previously used in our setup for a similar client in the past.
Not long after we celebrated this small win, the CPU spikes started again. At this point, our line of thought was to look ‘inward’ towards a memory leak in one of our applications. The application server has 19 applications deployed on it, and anyone of them could be the culprit. We removed all applications from the server and monitored the performance. Next, we proceeded to add the applications back one by one, each time noting the effect of each deployment on the system’s CPU.
This way we were able to narrow down to the defaulting application, which was the application responsible for processing subscriber data. We identified a need to deploy the application to a dedicated application server, and scaling to multiple instances of the application server. The results were remarkable, as the stability this time around was sustained over a longer period of time.
While we were investigating the application layer, we also had investigations on the performance and efficiency of the database happening in parallel. A deep dive into our application server logs revealed a significant number of database connection issues which had persisted for days, fueling the need for database-related optimizations. The database administrator advised that the ideal number of connections for the enterprise database in use was 700 connections.
Considering that we had 27 instances of the application server each with 20 and 100 minimum and maximum connections respectively, our configurations needed to be reviewed downwards so as not to exceed the optimal 700 connections. We allocated a buffer of 50 connections (out of the 700) to reports and queries, then split the remaining 650 connections among the 27 instances (minimum of 10; maximum of 24 connections). This configuration helped ensure we do not overwhelm the database beyond its optimal capacity going forward. Subsequently, we applied the configuration across all instances.
We had succeeded in stabilizing the environment by tuning the application servers for high load conditions and optimal garbage collection. We had also tuned the database configurations in our application server configuration file to ensure that database optimal performance and uptime were guaranteed and the database was never to be a bottleneck. Next, we turned our focus to fixing a lingering validation service error which had become a thorn in the customer’s flesh.
The validation service is an application which embeds several machine learning models used to achieve biometric quality checks. It is a high compute engine which works with a collection of functions packaged as dynamic-link libraries (DLLs) and as such is accessed by a Windows program. The service was frustratingly erratic and barely starting up. The application was deployed in a standalone, server-embedded mode and the logs were barely useful. We applied all the performance tuning done on the other application servers, with very little success recorded. We deployed more instances of the application in a bid to address possible load issues on the server. That too proved abortive.
We took a heap dump of the java process and analyzed it using JDK’s JConsole tool. Still nothing. We took a step back to retrace every change that had occurred on the application between the old and new infrastructure. This thought process highlighted that the Windows OS version in the new environment (Windows Server 2019) was different from that of the old environment (Windows Server 2012). We had carried out our developer and load tests on Server 2012.
At this point, we resolved to move the application back to an available Windows Server 2012 box. This worked!! We instantly created more instances on the box and internally logged the Windows 2019 server issue as a technical debt to be treated in the near future.
Our CEO Chimezie Emewulu, said it best, “there is no big red button that solves all issues but a series of micro-optimizations that work together to achieve a bigger whole”. I couldn’t agree more.
In retrospect, there is this overwhelming satisfaction that comes after such periods of daunting problem solving as an engineer living up to the art, of course, we would rather not have the problem occur in the first place as there is even more satisfaction in keeping your customers happy at all times.
However, situations such as ours help emphasize the need to foster a learning environment especially in fast-moving deliveries like we have in the telco space. And yes, the team proceeded to have a comprehensive retrospective where we scrutinized every touchpoint in our implementation process and documented all learnings to ensure that such situations do not happen again.