Do everything from home — that’s the state of affairs or the “new normal” COVID pushed the world into almost overnight. So for a fitness provider like cure.fit that offered workout classes across 500+ physical centers, this meant we had to go fully digital and offer the same classes online. As a result, our product was now available not only to those who went to the cult.fit & mind.fit centers but also to practically anyone, anywhere with access to a mobile phone, good internet and of course, an intent to work out. With so much time at hand and the realisation that health is more important than ever, a large number of new customers came flooding in. So, how did we manage and address this sudden surge in customer traffic?
In this article, we will walk you through the traffic pattern changes we witnessed with the introduction of live online classes and how we evolved our tech systems to handle the same gracefully.
Changes In App Usage
In the pre-COVID world, a typical users’ app usage primarily revolved around transactional use cases such as booking a cult class, ordering a meal, booking a doctor consultation etc. Meaning, the actual consumption of our products happened offline or outside the app. Each user spent on average only a few minutes on the app to do the above mentioned tasks and the traffic was distributed evenly throughout the day.
With all classes and consultations now happening online, all the activity that first happened outside the app now happens within the app. Which means each user is spending more than 30 minutes on average on the app, and a large number of them do so at the same time – thereby resulting in clustered traffic patterns with peak traffic during popular workout hours i.e from 7 AM – 9 AM and 9 PM – 8 PM.
Backend Traffic Spike
On our backend, we ran into a never-seen-before traffic pattern—at the end of each live session, we saw a 10x spike in load in our servers for the homepage API. On further analysis, it was found that once the class ends, most of the users head to the homepage to book their upcoming classes or to see their reports. As the homepage API fans out to several downstream services, it puts pressure on all our downstream services and their data stores eventually leading to an increase in response time and errors.
We regularly perform load tests on our servers for 5x traffic, geared towards sales traffic patterns. The main traffic drivers for such events were the app notifications sent to the users about the sale. The notifications were sent in batches in a staggered manner to avoid the sudden burst of traffic altogether. Different people then interact with the notifications at different instances of time and navigate to different sections of their interest from the sale page.
Contrary to this, the live class traffic spike was entirely unexpected and not planned for. Additionally, the traffic was much higher, and the consumption pattern was different with most users hitting the home page at the end of live class almost at the same time.
The following sections will talk about a few of the immediate fixes and the long term architectural changes we made to accommodate the unplanned traffic spike from live classes.
Pre-Scaling Services To Handle The Spike
At cure.fit, we follow a microservices architecture, where each service is optimised to scale gracefully based on its reaction to organic traffic patterns. Last year, our backend infra was moved to Kubernetes which helped us in better utilization of resources, and in bringing up the pods within a minute during those traffic spikes.
However, in this case, the auto-scaling policy also didn’t help much as the spike was quite sudden, and by the time auto-scaling kicked in errors had already taken place. To solve this, we set up a scale-up / scale-down pipeline in accordance with the traffic pattern seen during live class peak hours to ensure we have enough headroom to handle the end of class home page spike.
Although we had horizontally pre scaled services to handle the traffic spike, most of our backend APIs were querying the data store directly, which became the next bottleneck. Given below are the problems we had to deal with:
- An increase in the number of service pods led to an increase in the number of connections to DB, which in turn, resulted in DB reaching its connection limits.
- With the increase in load, we hit both CPU and disk IOPS limits in our MongoDB and MySQL databases, resulting in an increase in query times and higher error rates.
- As we shared DB servers between some of the microservices, any traffic spikes in one service brought down the performance of the other services that used the same DB server. Clubbing databases, though not ideal, allowed us to reduce the maintenance overhead of maintaining and monitoring multiple databases, and helped with cost control.
The immediate fixes we performed to solve the above problems are as listed below:
- As we already had the read replicas set up in both MongoDB and MySQL, we cautiously moved pure read calls to read replicas wherever possible.
- The databases were vertically scaled up. Since we used managed solutions—AWS RDS (Mysql) and Atlas(MongoDB)—scaling them up was pretty easy and helped us react to the situation quickly.
- We made all tier-1 services to have their own dedicated DB servers so that we can isolate the failure and not impact other services.
All these changes were made quickly as an immediate fix which helped us hold up the traffic spike, and handle it gracefully at the cost of increased infra expenses. Further, we decided to start with a clean slate—take a fresh look at our current homepage architecture and tech choices with the goal to reduce the infrastructure cost and scale further without any issues. In the upcoming sections, we talk about the changes we made to achieve the same.
Re-Architecting The Homepage
Our homepage (shown below) serves as a multi-purpose source that shows varied data from different product lines, includes widgets for managing the upcoming activities of the users (cult workouts, meal planning, live classes, etc.), and offers personalized recommendations for meals and live classes.
All the data that goes into the home page is fetched from different microservices. Hence, we were to follow a scatter and gather approach as depicted in the following diagram.
From the above diagram, you might have noticed that for every homepage call, API Gateway issues a call to every backend service to collect the required data needed, which is then sent back to the app. When our load increased 10x for live classes, it proportionately increased load on all our downstream services called from homepage even though they were not relevant. Due to this, we had to pre-scale all downstream services called from the homepage for handling 10x load whereas the actual spike in usage is only for live classes.
To optimise this, we moved to the below-depicted push and pull approach where we segment users based on their usage patterns, and only call the relevant downstream system if it is relevant. This eliminated the unnecessary fan out to all downstream services which happened in the previous approach.
The architecture shown above made sure services like eat services, care services, etc. received traffic from homepage in-line with their usage pattern. However, we still had to solve the issue with the live service usage pattern which was putting a lot of pressure on the data store.
We began by searching for reads which can be powered from cache, instead of hitting the primary DB. We picked Redis to power our caching requirements, as it was very fast, supported good scale, and we had in-house expertise with it. At the same time, we also had to be cautious of the amount of use-cases we end up caching in Redis, both due to the overhead of maintaining consistency, and the extra cost we pay for keeping the data in memory.
Following are some of the examples where we modelled TTL in line with product life cycle to not bloat up the Redis datastore but still benefit from faster Redis reads:
- For the live class schedule, we cached the schedule only up to 7 days as that is the visibility window on the app. Further, it keeps refreshing the data on Redis with a rolling window every night
- For the live class booking made by the user, we maintained the booking information in Redis with a TTL of class end time + 2 days as we were only interested in upcoming classes to be shown on the homepage, while the past completed classes are hidden one level deeper for which the data can be pulled from DB
Likewise, we also looked into different high throughput read flows and added caching considering the application nuances. This has helped us both in reducing the cost of DB infra and guaranteeing better API latency.
Rewriting High Throughput APIs In Java
Post-optimizations, we were able to reduce the cost of DB infra and all other downstream services. However, we still need so many pods running in the API Gateway to serve the spiking traffic coming from the app. Hence, we had to do performance analysis to better understand what was taking time in the NodeVM.
The performance analysis with FlameGraph pointed out that most of the CPU cycle was spent on JSON serializing and de-serializing. Though we tried reducing the JSON payload to a greater extent, Node VM’s performance still seemed to be degrading with an increase in RPS, which led us to do the following benchmarks to compare Node and Java performance.
Test 1: API that looks up user details from MySQL for a given user id and returns the JSON response.
|Stack||Number of cores||Average latency||p90 latency||RPS|
|Node||20||35 ms||56 ms||1500|
|Java||6||6 ms||8 ms||1500|
|Node||20||39 ms||59 ms||2000|
|Java||7||7 ms||10 ms||2000|
|Node||50||92 ms||180 ms||4000|
|Java||30||20 ms||37 ms||4000|
Test 2: API that looks up user details from Redis for a given user id and returns the JSON response.
|Stack||Number of cores||Average latency||p90 latency||RPS|
|Node||20||37 ms||56 ms||2000|
|Java||10||6 ms||9 ms||2000|
|Node||42||75 ms||150 ms||4000|
|Java||20||13 ms||29 ms||4000|
|Node||89||48 ms||269 ms||8000|
|Java||40||5 ms||19 ms||8000|
|Node||280||30 ms||80 ms||13000|
|Java||78||3 ms||20 ms||13000|
Based on the above test results, it was evident that we would be able to achieve much better latency numbers with less number of cores in Java.
Additionally, with respect to debugging any performance issues or optimisations, we found a lot of tools and articles on Java than on Node.js which could be attributed to the maturity of the JVM and Java user community over the years.
We chose Node.js (Express) as our primary API backend framework due to its popularity and ease of development. It also helped us iterate and move faster, especially during the early days where we could share a lot of libraries, utils and other code for both UI and backend. However, it acted as a bottleneck after a certain scale based on the above-listed benchmarks. Hence, we had to rewrite a couple of high throughput APIs in Java as Java was the de facto choice for all our high throughput backend systems.
With all the above optimizations in place and rewriting the APIs in Java, following are the improvements in latency we witnessed for two of our high throughput APIs with lesser number of API Gateway instances—1/7th compared to the number of instances required before these fixes. The latency improvements we witnessed are as follows.
|API||Before fix p50||After fix p50||Before fix p90||After fix p90|
|App launch||110 ms||18ms||140ms||30ms|
When every organization matures and scales, rewrites become a necessity, and we had to do this in a hurry due to the unexpected situation we found ourselves in. Now, we are building a culture of keeping performance in mind whenever building a new API. Additionally, we have added performance automation alongside functional automation into our CI / CD pipeline to ensure APIs are adhering to performance standards. Read more about it in this blog.
On a final note, below we present the key learnings we got through this entire process.
We are just getting started on the scaling journey. We will have to solve for even higher scale as our digital product user base grows over the coming years.