Dark Launch is the best Light Launch
Easter eggs everywhere
In my last job, we used to build product recommender systems for eCommerce companies- that means our APIs were live on every page of their website. We got a new customer which was our first 7 figure deal and we were so cautious about them that we initially onboarded them in March with rule-based recommendations. We didn’t want to risk bad user experience through our nascent Machine learning models.
Later in April, we built out Machine learning models and performed extensive offline testing and a lot of manual QA. Finally, we felt confident that our model will perform well and we then launched and two things happened-
- Emergency Call: We got an emergency phone call from them in a few minutes while we were already puzzled with recommendation results and trying to debug those. We had easter eggs recommendations on almost every page of their website. Turns out that we had misconfigured the model id with the model that we trained in April with their Easter shopping data! We immediately switched the model id to the new one and that problem was fixed.
- Increase in p90 latency: Turns out that we were using product descriptions as features in our model and some of the descriptions were very long which increased our feature computation time. This was not easy to detect during our offline testing because the model latency seemed fine in most of the cases. There was no good immediate fix to this problem and we basically had to revert to the rule-based system until we fixed our featurization and re-tested the model.
Overall, this resulted in a lot of fire-fighting, a lot of credibility loss and an almost lost customer. Through our internal retrospective later, we realized that while #1 was just a manual miss, it was almost impossible to detect issues like #2 offline. Since then, we moved to the bright side of doing dark launches!
What is dark launch?
Dark launch is a deployment strategy that allows you to replay your actual production traffic to your newly deployed service and discard the response before returning it to the user. It behaves as if the service is actually live but does not affect the users at all. This allows you to verify that your new service does not have any errors, has comparable or better performance compared to your old service, and can handle the production load. Once all of this is verified, it’s almost trivial to actually switch to your new service incrementally. So in a way,
Dark launch is a light way to launch your services.
with very minimal downside and huge potential upside.
How to dark launch?
Dark launching your services is one of the realistic ways of testing your services and models on a production-like system. But executing dark launch can require a lot of set-up and maturity within the organization from a development, monitoring and infrastructure standpoint.
- Adopt a micro-service architecture: It’s important to adopt an API-driven micro-service architecture to be able to incrementally test your new services. The way to execute on a dark launch is to replay a copy of the production traffic to a new backend which is best done if the current production service and the new service are both available as a micro-service & the communication happens over a REST / gRPC call.
- Traffic forking: Most frequently the application frontend is where people do the traffic fork where a percentage of production traffic is routed to the new service.
- Async calls: A general principle of testing is that it should not impact your actual production experience adversely. When dark launching you are practically duplicating the production traffic which can 2x your latency unless you make your backend calls async. If your service is not latency-critical then putting reasonable timeouts can also be a solution.
- Infrastructure: Ideally, your organization has built out infrastructure that is easy to scale horizontally because as you increase traffic percentage to your dark launch service, you incrementally need to scale your infrastructure as well. In most cases, it may even make sense to replicate the full peak traffic and beyond to make sure that your new service will truly scale.
- Logging: You will need to log the request & response from your old and new backend services to be able to compare the response and service performance. If it’s a Machine Learning model, you want to make sure that your model predictions are at least as good as your old model. This requires extensive logging.
- Monitoring: Dark launch is fairly useless if you don’t have good monitoring & instrumentation dashboards where you can compare the uptime, latency, scalability, and quality of response of your new service. Ideally, this should happen in real time so anomalies can be detected and surfaced quickly.
Is dark launch just a fancy offline testing?
Offline testing allows you to check the behaviour of your system, usually in isolation. Rarely would it allow you to test the end-to-end system along with the state of the surrounding system with realistic traffic and network settings like production? You can achieve 70% of all of this through meticulous logging and very complicated offline testing but dark launch turns out to be a much simpler system. This is because you anyways end up doing most of the above steps to launch and monitor a service normally. After you have done a successful dark launch, your actual release of the new service is almost trivial so the effort-reward ratio becomes well worth it.
There are a couple of cases where this could be hard to justify practically- for example, if your service is stateful, or it actually changes the database then doing a dark launch is a lot more complicated. In my personal experience, ensuring the correctness of the system becomes so hard that it's almost better to just settle for offline testing over dark launch!
If you are more curious about dark launches or want to share some of your experience, please get in touch with me at nikunj@truefoundry.com!
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.