hen Curefit started 4 years ago, we had a high-level idea of what kind of problems we would have to solve. And while the product and business teams were figuring out the finer details, the first question we (the tech team) had to answer was “What should be our tech stack ?”
Take the above question lightly and you might end up rewriting big parts of your systems in just a year. If you get fancy with it, you face problems that very few people in the world can help you with. If you get lazy with it, you will spend the majority of your time solving problems that have long been solved.
Considering this is a problem most of us face in the early days, we felt that sharing our learnings and thought process would help folks choose the right tech stack
Tenets → Tech stack philosophy → Tech Stack
You can’t solve a problem unless you know what you are solving for. Answering what you want from the stack is important and makes choices easier. Depending on the problem space you are working on, factors such as the speed of execution needed, amount of tech budget have a direct implication on what you need to solve for. Here are the tenets we came up with for our use cases
Tenets required from tech stack in Curefit
- Should be easier to pick up new projects or different codebases without a steep learning curve. People shift their projects/teams often and we encourage that.
- It should make hiring easy and not require us to interview on a particular language or infrastructure. It should not lead to months of training or onboarding.
- Go for depth rather than breadth. It is impossible to develop depth across multiple languages/frameworks/tools. When things break (and they will definitely break) or you get stuck, you need multiple experts in house to solve and move ahead quickly.
- Stay away from buzzwords. The tech field is filled with buzzwords which change every few months and years. Just choosing something because “it’s cool” or because “another team in your past company used that” or “because my flatmate raves about it” are the easiest traps to fall for. Research and choose the tech that solves for all the tenets.
- Tech stack should not get in the way or make progress slow/costly. At the same time, it should not encourage writing bad code/hacks. It should not make reading someone else’s code impossible.
- Most tech effort should go into your company’s core product or competency. If the majority of your tech bandwidth is going into building something that you are not selling to your customers or making their experience better or safe, something is wrong.
Our tech stack philosophy
- Microservice architecture – This is a no-brainer today for 99% of the startups. The benefits are widely known for a long time now. The only thing to keep in mind is to not overdo it. E.g. “A service should just do one thing” should not translate to one API per service. Curefit is a big adopter of microservice architecture and has 100+ live services at the moment.
- Build, buy or embrace open source – This is something that will vary the most across different startups and depends hugely on the size of the team, future aspirations and capital available. In Curefit we build something in-house only if one of the following holds true
- It is a core problem for us or a part of what we are providing/selling to customers: E.g. A code build tool is something that is critical for us to function but not part of our customer offering. So no point building it (just use Jenkins / code build etc). On the other hand, delivery boy routing and allocation is very core to eat.fit business PnL and directly impacts the customer experience. So we built it ourselves.
- No other existing tool quite fits our requirement/philosophy: There are too many tools and Saas products that have been built by very talented folks throughout the world. But quite a few times, they are not configurable or are super hard to configure/learn/integrate. Building something yourself in such cases might be better because such tools do exactly what you need, are lean and super flexible
- Existing tools are cost-prohibitive
- Two is better than one and three – A typical microservice has a server layer, cache layer and a DB which make up for the bulk of it. A developer needs to choose all these depending on the use case. Unless the requirements are very specific and special (e.g. super-strong consistency, or very high write to read ratio, hyper compute requirements, etc), most common well-established tech stacks can handle these. Because as a startup we wanted to go for “depth rather than breadth”, we would have ideally chosen one technology for one part of the stack throughout all services but we understood that would become too restrictive. Also, three or more choices would have resulted in proliferation in the number of technologies that we as a company would have to understand, maintain and build long term expertise in.
Hence we decided that we should have 2 default options for all parts of our stack. And till this day we don’t ask a single question to our developers as to why they chose one option over the other in their service.
We also understand that a cookie-cutter approach like above might not work in 100% of our services since there are differences in requirements. That’s why we are completely open to adopting new technologies but the developers working on the problem in question have to make a strong technical case for why our default options, in that case, is a bad choice.
Curefit’s tech stack
- Microservice stack: We are a huge proponent of microservice architecture from day one and currently have 100+ microservices across all our products and services. The services are all deployed on a Kubernetes cluster and talk to each other over HTTP. Services built using either of the stacks have lots of packages/libraries which help ensure they have a common baseline.
- The MERN (MongoDB, Express, ReactJS, and Node.js) stack is a popular choice amongst our teams. One of the main advantages using this stack is that it’s much easier to write client-side code as well as it reduces the context switch between multiple languages. The services connect to other databases depending on the requirement. For example, we use an RDB like MySQL for our inventory service. It’s also extremely easy to build asynchronous flows in typescript.
- The other stack is built on Java and Spring Boot for the backend. A lot of our highest throughput APIs are built on this stack, though primarily preferred for backend only services.
- DB stack: We currently have centralized database instances for the whole company instead of having instances at a team/product level. This choice has its own pros and cons (maybe topic for another blog) though. Our services primarily use MongoDB and MySQL to store data.
- MongoDB, being a NoSQL store, helps us design and evolve our schemas with relative ease and flexibility. We use MongoDB Atlas to host our MongoDB servers. Atlas makes it very easy to make configuration changes like scaling up, adding more storage, creating read-only or analytics nodes. It also has a performance advisor which provides Index Suggestions with easy to consume statistics. The dashboard with lots of relevant metrics, help us guide towards the root cause for any issue cropping up. We make use of Replica Set clusters as well as Sharded clusters for varied use cases to maintain our data.
- MySQL is one of the most popular, free, open-source relational databases. We chose Amazon RDS as it would help reduce our administrative efforts around hardware provisioning, resizing capacity, database setup, monitoring, patching and backups. We can create read replicas quite easily to scale our read-only flows and reduce the load on our write nodes
- Data Warehouse: Most companies today implement a data warehouse backed by columnar storage. Data from all of our data sources are pushed to the warehouse through a third-party ETL tool, which drives a lot of our reports and dashboards. We chose Amazon Redshift as our data warehouse. Redshift offers automated provisioning, backups, integrations with a range of third party tools, multiple methods to connect and query the data concurrently.
- Caching: Caching can help reduce latency and increase the throughput of your system manifold. We use a combination of redis and in-process (inside the application memory) caching.
Redis provides an easy to use cache to start with but also supports complex atomic operations like set addition that helps set up a more mature cache. Some of the use cases where we find redis useful:
- Segmentation: Every time a user lands on the app, we identify the attributes of the user. Is the user a cult member? Has the user tried out eat.fit meals? These attributes help us provide a more personalised experience to the user. Such attributes must be derived at almost every app call. Storing this data pre-computed in redis helps us address this.
- Catalogue: Anything that can be purchased on our app must reside inside the catalogue and must be accessed to power most of the user interactions with the app. Again, redis can help us attain a high throughput for this. Along with that, we can also divide our catalogue into various sets – one for each vertical – so that it can remain segregated.
Both of the above use cases require that the data present in redis be consistent. Hence, we use a write-through mechanism to keep this in sync with the primary store.
We also use redis to optimise most operations that are IO intensive or computationally intensive. For example, the list of active offers for a vertical is stored in redis with a TTL. This ensures that most of the calls hit redis and avoid a db call. Once the TTL expires, a call will go to the db, fetch the latest data and update cache.
In-process cache: Due to sharing memory with the application, this cache avoids the overhead of connection pool and network calls. Data like application flags, configurations, and small data tables are the most common needs that are cached in in-memory LRU caches.
- AWS services: Using hosted services for development gets you off the blocks quickly. Moreover, using multiple services from the same provider comes with the benefit of strong integration between them. We use AWS as the preferred solution for hosting needs as it gives our engineers to work majorly on health-related problems rather than infrastructure-related problems. Here are the services we use the most
- Relational Database Service: Hosted relational databases. We use multiple RDS instances and clusters, enabled with monitoring and backup.
- ElastiCache: Hosted Redis clusters.
- Elasticsearch Service: Host Elasticsearch cluster. Used to power search operations on curefit app as well as internal dashboards.
- Elastic Compute Cloud: Hosted VMs. Our Kubernetes cluster <link> is hosted on this.
- CloudWatch: Used for monitoring all applications and related infrastructure in the form of logs, metrics and events.
- AWS MediaLive: Processing and encoding video streams. Enables high availability of video streaming.
- Simple Storage Service: Storing all assets like images, documents, videos.
- Redshift: Integrated with our source databases to power data warehousing.
- Simple Email Service: Our internal communication service integrates with SES to power all email communication.
- CloudFront: CDN that integrates with S3 powers the quick delivery of content across multiple zones.
- Simple Notification Service: Publisher Subscriber messaging service crucial to integration across our microservices.
- Simple Queue Service: Hosted queue service that enables decoupling of logical components.
- Kubernetes (Read more here) – Most of our microservices are deployed on a Kubernetes cluster which itself is spawned over spot instances. This gives us multiple advantages
- Faster startup and teardown times
- Agnosticity over future cloud changes
- Uniform way of deploying dev, stage, load testing and prod environments
- Better utilization of compute resources
- And a lot more (covered in our previous blog post on our k8s setup)
Frontend – Websites and apps (Internal and external)
- React js – At the time we started Curefit, it was much easier and intuitive to write code in react js as compared to Angular etc. It was also much more performant. We use reactjs extensively for our customer website and 30+ internal dashboards. We also use redux for state management. Looking back we are pretty happy with this choice and believe it will stand the test of time.
- React-crux – react-crux was an internal frontend library for backend developers to create simple admin dashboards quickly. We open-sourced it and really believe it could be of immense value to any startup. Check out our previous blog on react-crux for a detailed explanation.
- React Native – Most of our early engineers were from pretty well-respected companies which had built solid mobile apps in the past and one thing that we did not want to do was have separate teams for Android and iOS. Companies that do that are perennially stuck in feature parity sprints and problem-solving. Four years ago, react-native was the only real technology available that let us write cross-platform code which was performant and also fully supported by Apple and Google.
React Native is still in its early years though and as a philosophy, we stay away from its latest version and always stick to the 2nd most latest version. Good app engineers are super rare in India and choosing a technology that helps us develop apps for two platforms with half the team is amazing.
Analytics and Data Warehousing
- Teams at Curefit reach decision points daily and having visibility into the data helps take the right call almost always. For an Analytics solution to suffice continuously evolving needs, we were looking at a few core objectives while choosing an approach like Ease of KPI Definition and Computation, Near Real-Time Data Ingestion, Self-Serve Query System, Connectivity to BI tools etc. We have already written about our Analytics Stack, the challenges we faced and more. To summarize, once we read and transform data from all the different types of data sources, we push it to Redshift. We also keep recent data in Postgres to power near real-time demands. We currently use Metabase to help our analysts play around with data. With an ever-increasing set of use cases, data size, performance expectations, flexibility etc we are working on replacing some of the processes and tools that we currently use.
Other third-party services and tools that we use
As mentioned earlier in the blog not everything should be built in house and there are plenty of good third-party services and tools out there that can help you execute first. Here are some of them that curefit uses heavily.
- Spot Instances: Spot instances on AWS EC2 are available at a considerably lower cost, lower than even reserved instances. But the downside with spot instances is that it can go away any time. With Spot.io elastigroup product, we were able to configure the distribution of the types of instances, regions etc we will need and it tries to stick close to that. It also predicts and optimizes the kinds of instance types it holds on to. When the machine is about to go down, our k8s cluster is made aware of that much in advance, which helps bring up new boxes in time, to help take up the load. Spot also analyses patterns and prescales as well, helping in a better experience for our customers during peaky load.
- Backend Application Monitoring: With an increasing number of internal services and ever-evolving dependencies between them, it’s really important to have complete visibility into how a service is behaving. Some of the critical aspects to be monitored are throughput, response times, error rates, time spent in databases / external services. We chose New Relic as it offered all these and more, and was also compatible with the languages being used by our teams. We can also set up alerts on certain deteriorating conditions and give us a heads up before conditions get worse. It has helped us in visualizing the impact of any tuning done at an infrastructure level as well as when performance fixes are released. We did end up sampling services, to contain costs with an increasing number of boxes. We are evaluating Datadog as its pricing models are better at the time of writing and because Newrelic’s pricing is not helpful for companies running most of their services on k8s
- Error Monitoring: We required a service which would give us real-time insight into errors, their root cause and their occurrence rates. This would help us assess the severity, impact and quick triaging as well resulting in better turn around times for fixes. We used Rollbar to help us achieve this in the languages we use for our services. Rollbar also has a nice feature which allows you to write custom rules (fingerprints) to club different error cases into one, helping to identify actual occurrence rates. Rollbar also has lots of integrations like Slack, Jira, OpsGenie which facilitates adequate responses. We are also evaluating Sentry as an alternative to Rollbar, and it seems to have a similar feature set at a much lower cost. Sentry works on mobile app and web as well.
- Business Metrics Monitoring: Every business typically has multiple metrics which need to be monitored and visualized in real-time. We talked about application and error monitoring earlier. All our services push various multi-dimensional metrics at certain points in the workflows. These metrics help visualize the number of visits, orders, payments, sessions started/completed, dispatches, delays/lag, deliveries etc. We used Prometheus to collect the metrics, which is backed by a time-series database. It has built-in integration with Grafana, which helps us visualize, query and set up alerts on these metrics. Grafana also enables the creation of very specific Dashboards to give an overview of the health/state of the system. The integrations on top of Grafana alerts, help notify relevant teams on priority.
- Firebase: We use Firebase Cloud Messaging to do pushes from our server to the apps our customers and partners use. We also use their real-time database to power our live tracking feature for deliveries. Firebase, being a NoSQL document-based store, facilitates evolving the schema over time with ease. We also use crashlytics (now part of the Firebase suite) heavily to track client-side app issues and crashes
- Graphhopper + JSpirit : To fulfil the orders placed by our customers, a lot goes on behind the scenes (Eat Fit Deliveries), to plan out the workflows at first and the last mile. The physical movement of goods/products needs route planning and estimates of travel times at different times of the day. We hosted augmented versions of Graphhopper to get travel time estimates between locations and Jspirit to come up with possible plans for a capacitated vehicle routing problem with time windows.
- Code Search: We use GitHub as our source code repository. There is a frequent need to search by regular expressions across repositories. As of writing the article, Github does not support this. We use Hound which indexes our codebase and provides blazingly fast code search and the query language is richer than that offered by Github.
- Jenkins: Jenkins is an open-source automation service. We use it to create packages for deployment, building library packages and pushing it to a repo, scheduling of jobs as a trigger etc. Jenkins allows us to scale worker nodes which allows us to keep a check on the turn around times for jobs scheduled to completion
- Spinnaker: We use Spinnaker to deploy changes to all our Kubernetes clusters across all environments. Spinnaker Pipelines allows us to customize the set of steps to be performed and can be different for each environment. These range from functions that manipulate infrastructure (deploy, resize, disable) as well as utility scaffolding functions (manual judgment, wait, run Jenkins job) that together precisely define your runbook for managing your deployments. It also has out of the box the integrations with Jenkins to get notified when a build is ready
- CloudFront + Cloudinary: Our App / Website has lots of static content like banners, promo videos, icons etc. CloudFront is our CDN for videos and images, which servers all our static content. The original source content is uploaded to S3 via our internal dashboards. Videos are directly read from S3 and cached by CloudFront. For Image optimization, we use Cloudinary as the source for the CDN. Cloudinary, in turn, adapts the raw image on S3, as per the requested aspect ratio, resolution etc and returns that to CloudFront. We found the CloudFront cost structure to be more effective in comparison to that offered by Cloudinary.
- JIRA: All our issue tracking across various teams happens over JIRA. We create JIRA tickets based on certain flows, metrics, experience ratings etc. We use JIRA to maintain sprints for development efforts as well.
- FreshDesk: We use Freshdesk for our customer support team primarily. FAQs across different flows are also maintained in Freshdesk by our product teams. We have also built internal integrations with Freshdesk to show ticket level data on different dashboards and to also power some parts of our knowledge base on the app/website
- NpmJS, Github: Our typescript based projects use npm as the package manager and we picked Npmjs as our package repository to support these projects. It works quite well with npm and yarn. We are currently in a transition to move away from npmjs to Github packages, primarily as the cost structure works better for us. Also, our codebase is already on GitHub and we are planning to move more of our CI pipelines to GitHub.
- Slam dunk decisions
- Having an explicit philosophy behind the tech stack helps us to this day. The concept of having just two options for all parts of the tech stack has worked out well for us. There are days when we think it would have been better to just have one option, but having 2 options also helps developers new tech stack from time to time
- Choosing a cloud provider from day one. This led to a massive boost to developer productivity.
- Some things that we are now changing
- Really strict cost attribution and accountability on infrastructure cost. Tech infra and related costs go hand in hand. We used tech infrastructure as a common cost centre for the whole company and it worked reasonably well for the first 3 years. We are currently in the middle of a big project to attribute every dollar spent on a specific project. This cost (along with utilization numbers) also shows up in the PnL of that particular business to define pinpoint profitability numbers. In hindsight, we should have probably done this a couple of years back
- Choosing java based servers for high throughput services. Through rigorous load testing, we are now inclining towards writing services that handle high API throughput in java. Writing services in a node has its advantages (and we will continue to do so for all our admin dashboards etc) but high performance is not one of them. We will write more about this in another blog 🙂
While we continue to evolve our tech stack philosophy and the tech stack itself, we felt that we have learnt quite a bit in our 4+ years of journey at Curefit. This blog is an attempt to share that information with more budding startups in an ecosystem where tech bandwidth is scarce and costly. To keep the blog concise, we wrote it in a “breadth-first” way, to cover all parts of a typical tech stack.
For more details and feedback, please leave your comments in the section below 🙂
Authors – Abhilash L L, Pravesh Jain, Ankit Gupta