True ML Talks #9 - Machine Learning Platform @ DoorDash

TrueFoundry with Head of ML Platform at DoorDash
TrueFoundry with Head of ML Platform at DoorDash

We are back with another episode of  True ML Talks. In this, we dive deep into DoorDash's ML Platform, and we are speaking with Hien Luu.

Hien Luu is the Senior Engineering Manager at DoorDash, building the build out of DoorDash's ML platform. DoorDash, as everyone knows, is one of the biggest companies in food delivery in the US, more than $25 billion company.

đź“Ś
Our conversations with Hien Luu will cover below aspects:
- ML Usecases in DoorDash
- Designing a Scalable Model Serving Layer
- Shadowing Models: Accelerating Testing and Deployment
- Standardization via gRPC
- Streamlining Feature Engineering and Data Formats
- The Importance of Model Validation and Automated Retraining
- Challenges and Opportunities for ML Ops in Supporting Generative AI and LLMs

Watch the full episode below:

Usecases of ML @ DoorDash

  1. Efficient Order Assignment and Delivery: ML algorithms play a pivotal role in predicting order preparation time, estimating delivery time, and routing Dashers for optimal efficiency. Leveraging historical data, such as restaurant cooking times, traffic patterns, and weather conditions, DoorDash dynamically assigns orders to Dashers, ensuring faster deliveries and a seamless customer experience.
  2. Personalized Search Recommendations: ML-powered search recommendations have become a standard feature for online platforms, including DoorDash. By analyzing customer preferences, order history, and contextual data, DoorDash employs ML algorithms to suggest relevant restaurants, cuisines, and dishes to users. This personalized approach enhances the user experience, encourages exploration, and boosts customer satisfaction.
  3. Targeted Ads and Promotions: DoorDash leverages ML to deliver targeted advertisements and promotions that align with user preferences. By analyzing user behavior, transaction history, and demographic data, DoorDash tailors its marketing campaigns to specific customer segments. This targeted approach increases the effectiveness of promotions, fosters customer loyalty, and drives engagement.
  4. Proactive Fraud Detection: To combat fraud, DoorDash utilizes ML algorithms to detect and mitigate fraudulent activities, including fake orders, account hijacking, and payment fraud. By analyzing patterns, anomalies, and historical data, DoorDash proactively identifies fraudulent behavior, safeguarding customers and maintaining the platform's integrity.
  5. Menu Item Classification: Onboarding a vast number of merchants with diverse menus poses a unique challenge for DoorDash. ML algorithms are employed to automatically detect and classify menu items accurately. By processing images, text descriptions, and customer feedback, DoorDash seamlessly integrates merchant menus into its platform, providing customers with a rich and consistent browsing experience.

Designing a Scalable Model Serving Layer

DoorDash's MLOps team's scalable model serving layer is a crucial component of their machine learning infrastructure that supports billions of predictions every day. The following are some insights into the architecture and key decisions that enabled the growth of their model serving layer.

  1. Focused library support: DoorDash's model serving layer was designed to support two key libraries - GBM and PyTorch. This decision allowed the MLOps team to build optimized solutions for these libraries, ensuring efficient and effective model serving.
  2. Batch prediction support: To reduce network call overhead, the model serving layer was designed to support batch prediction. This is particularly beneficial for use cases like recommendation systems, which generate thousands of rankings for a single user. By processing predictions in batches, the system achieves better performance and scalability.
  3. Model shadowing for testing: The model serving platform incorporates a feature called model shadowing, enabling data scientists to test their models in production without affecting live user traffic. This shadow mode helps them gain confidence in the model's performance and behavior before promoting it to full production, ensuring a smooth and error-free deployment process.
  4. Microservice Architecture: The model serving platform at DoorDash follows a microservice architecture. Leveraging Kubernetes, the platform organizes models into isolated pods, enabling independent scaling based on individual needs. This architectural approach promotes modularity, scalability, and efficient resource allocation, aligning with industry best practices for building microservices.
DoorDash’s ML Platform - The Beginning - DoorDash Engineering Blog
DoorDash uses Machine Learning (ML) at various places like inputs to Dasher Assignment Optimization, balancing Supply & Demand, Fraud prediction, Search Ranking, Menu classification, Recommendations etc. As the usage of ML models increased, there grew a need for a holistic ML Platform to increase th…
Learn more about ML used in DoorDash

Shadowing Models: Accelerating Testing and Deployment

Implementing a shadowing layer within DoorDash's model serving infrastructure has revolutionized the speed at which models are tested and deployed. This section delves into the unique aspects of the shadowing layer, its distinction from Canary testing, and its profound impact on facilitating efficient model testing for data scientists.

Streamlined Shadowing Process

DoorDash's shadowing layer simplifies the process, ensuring that data scientists can effortlessly conduct model tests. The implementation is both straightforward and powerful. Data scientists utilize configurations and an intuitive tool to specify a primary model and shadow models. With just a few clicks, they can allocate a desired percentage of incoming traffic (e.g., 1% or 2%) to be routed to the shadow models. The platform handles the rest, including loading the designated model into the appropriate pods, seamlessly routing the specified traffic, and logging predictions for the shadow models.

Accelerating Velocity and Empowering Data Scientists

The simplicity and user-friendliness of DoorDash's shadowing layer have dramatically expedited the pace of testing and deployment for data scientists. By eliminating unnecessary complexities and minimizing reliance on engineering support, data scientists enjoy full autonomy over the shadowing process. This newfound agility empowers them to iterate on their models more frequently, resulting in an accelerated development cycle and fostering rapid innovation.

However, as the number of models and traffic volume increases, it is essential to address considerations such as the scalability of the logging system and cost management. Striking a balance between efficient operations and the expanding scope of model testing remains crucial for sustaining the benefits of the shadowing layer.

Ship to Production, Darkly: Moving Fast, Staying Safe with ML Deployments
Learn how DoorDash balanced ML models’ release speed and reliability by shipping darkly in order to manage fraud model deployments
MLops in DoorDash

Distinguishing Shadowing from Canary Testing

  1. Shadowing: Shadowing models in ML Ops refers to the practice of testing and evaluating models within a production environment without affecting live user traffic. It provides a safe space for data scientists to gain confidence in their models' performance and behavior before deploying them fully. By routing a portion of incoming requests to shadow models, data scientists can assess their effectiveness and make informed decisions.
  2. Canary: Canary testing, on the other hand, involves gradually rolling out new models to a subset of users to evaluate their performance and stability compared to the existing model. It helps identify any issues or discrepancies before deploying the new model to the entire user base. Canary testing allows for a controlled evaluation of the new model's impact on user experiences, enabling data-driven decisions regarding its adoption.
đź“Ś
Standardized on gRPC
DoorDash adopted gRPC as the standard protocol across the company. This choice was driven by the need for stability and efficiency at scale. The binary protocol of gRPC, along with its battle-tested nature, appealed to DoorDash's focus on optimizing every aspect of their ML infrastructure. The decision to use gRPC for service-to-service communication ensured reliable and efficient interactions between components of the model serving layer.
We all believe that when you do things at scale, every little thing matters and I think the binary protocol, it's good for that when you start offering a scale and gRPC has been battle tested at many, many companies.

Streamlining Feature Engineering and Data Formats

In order to facilitate feature engineering and model training, DoorDash focused on optimizing its infrastructure and data formats. Initially, the company utilized Snowflake as a data warehouse, which provided efficient data storage and management. However, as they scaled their model training operations, retrieving data from Snowflake proved to be inefficient. Recognizing the need for a data lake, Hien Luu advocated for its implementation, drawing from his experience at LinkedIn where a data lake had proven to be a valuable asset for numerous use cases. Building a data lake took time and effort, but once in place, DoorDash could leverage it to construct their feature engineering framework.

The feature engineering framework served as an abstraction layer, allowing data scientists to express how they wanted features to be computed. DoorDash's infrastructure then handled the computation, scheduling of pipelines, and resource management on behalf of the data scientists. Collaborating with the data lake team, optimal formats were determined for storing the computed features.

In addition to the offline feature store, DoorDash also employed an online feature store. The majority of use cases involved online predictions integrated into production systems, necessitating the presence of an online feature store. Both offline and online feature stores were maintained, addressing the training and serving discrepancy commonly encountered in the industry. To synchronize the feature sets between the two stores, generated features were stored in the offline feature store and subsequently uploaded to the online feature store. By using the same logic for both offline and online scenarios, the feature engineering framework simplified the process. Data scientists could specify their desired features for both stores and rely on the infrastructure to handle the underlying mechanisms, such as scheduling the uploads.

Five Common Data Quality Gotchas in Machine Learning and How to Detect Them Quickly - DoorDash Engineering Blog
Data preparation, represents The vast majority of work in developing machine learning models, learn how to make things easier
Data in DoorDash

The Importance of Model Validation and Automated Retraining in MLOps

Validating Model Performance

Ensuring the accuracy and reliability of machine learning models is a critical aspect of the MLOps process. Model validation involves testing the performance of a model using real-world data to verify its effectiveness. By automating this validation process using tools like MLflow, data scientists can track experiments, compare results, and evaluate different models based on their performance metrics. Model validation provides confidence in the model's ability to make accurate predictions and informs decision-making in the deployment process.

Automated Retraining for Optimal Performance

Automated retraining takes model validation a step further by enabling models to be automatically retrained based on predefined criteria or thresholds. This proactive approach ensures that models stay up-to-date and continue to perform optimally over time. By minimizing manual intervention, MLOps teams can reduce the risk of human error and streamline the retraining process.

Implementing automated retraining requires careful consideration of each model's specific needs and potential consequences. MLOps teams must design and implement safeguards and flexible processes to ensure that models are retrained appropriately. This involves planning and testing to determine the optimal retraining frequency, criteria for retraining, and strategies for promoting the retrained models to production.

The benefits of automated retraining are substantial. By continuously updating models, organizations can maintain their accuracy and reliability, adapt to evolving data patterns, and address potential performance degradation. Automated retraining also reduces the risk of errors and downtime in production environments, as models are proactively improved and updated.

Incorporating model validation and automated retraining into the MLOps infrastructure is crucial for building robust and reliable machine learning systems. By leveraging automation tools and implementing well-designed processes, organizations can ensure that their models deliver accurate predictions consistently and adapt to changing conditions effectively.

Challenges and Opportunities for ML Ops in Supporting Generative AI and LLMs

Generative AI and language models (LMs) have the potential to revolutionize many industries, including food delivery. However, effectively leveraging these technologies requires ML Ops teams to tackle several challenges and opportunities.

  1. Rapidly Evolving Space: Keeping up with the fast-paced advancements in generative AI and language models (LMs) poses a challenge for data scientists and ML Ops teams.
  2. Focus on Supporting Data Scientists: The focus should be on determining the necessary infrastructure and tools to support data scientists in effectively utilizing LLMs for their specific use cases.
  3. Prompt Engineering: Infrastructure teams can play a crucial role in assisting data scientists with prompt engineering, helping them optimize and fine-tune prompts for desired outputs.
  4. Internal Hosting for Privacy and Latency: Some use cases may require hosting LMs internally to address privacy concerns, reduce latency, or control costs. Understanding how to set up internal hosting and work with GPU configurations becomes essential.
  5. Investing in Infrastructure: Recognizing the potential of LLMs and generative AI, companies like DoorDash are investing in the necessary infrastructure to support diverse use cases and empower data scientists.
  6. Leveraging Open AI and Internal Models: Different use cases may require different hosting approaches. Some can leverage open AI models, while others may necessitate internally hosted models based on factors like latency, dataset, scale, and cost.
  7. Resource Management and Scalability: Effectively managing resources and addressing the challenge of model quantization are key considerations for hosting LLMs in a scalable manner.

Here is another interesting blog written by the team at DoorDash around Generative AI:

DoorDash identifies Five big areas for using Generative AI - DoorDash Engineering Blog
Discover how DoorDash plans to revolutionize the delivery experience with Generative AI and enhance the customer’s ordering journey.
GenAI in DoorDash

Read our previous blogs in the True ML Talks series:

True ML Talks # 8 - Machine Learning Platform @ Intuit
In this blog, we dive deep into Intuit’s Machine Learning Platform, and Numaflow. Understand Intuit’s ML architecture, how ML is used at Intuit.
TrueFoundry discussing ALops in Intuit

Keep watching the TrueML youtube series and reading the TrueML blog series.


TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation. Â