True ML Talks #5 - Machine Learning Platform @ Simpl

TrueFoundry discussing about Machine Learning Operations at Simpl, a fintech startup

We are back with another episode of  True ML Talks. In this, we dive deep into Simpl's ML Platform, and we are speaking with Sheekha.

Sheekha is the Director of Data Science at Simpl. Simpl is building India's leading first tap checkout network where they provide merchants with an entire set of products starting from BNPL to helping them pay in installments to a lot of other value-adding services. They work with more than 26,000 merchants across India, including JIO platforms, which is the largest telecom network; Zomato, which is one of the biggest food delivery services in the country, and a lot more.

📌
Our conversations with Sheekha will cover below aspects:
- ML use cases in Simpl
- Overview of Simpl ML Infrastructure
- Managing Costs for ML Training
- Managing Training and Inference Pipelines Separately
- Automation in Retraining ML Models
- Simpl's Foray into building in-house
- Considerations for Real-time Systems and Data Science Models
- Making ML Deployment as Simple as Software
- Ingraining Engineering Principles in Data Science

Watch the full episode below:

ML Usecases @Simpl

  1. Fraud prevention and risk assessment: Simpl's ML system analyzes every transaction and uses simple rules, filters, machine learning models, and neural network systems to identify high-risk transactions such as account takeover cases, identity theft, or other suspicious activity. The system can prevent fraudulent transactions, which can result in loss of money and the inability to serve good customers.
  2. Underwriting: Simpl's ML system helps the company make underwriting decisions by analyzing onboarding data provided by users. The system determines how much credit a user is eligible for and what their spending limit should be. Simpl's teams are involved in the underwriting process and are moving towards more real-time pipelines and systems.
  3. Customer support: Simpl's ML system helps the company work with customers who have trouble paying on time. The system can remind customers about upcoming payments or offer alternative payment plans that work for both parties. Simpl's teams work with customers to find the best way forward, ensuring a positive customer experience.

We found this interesting news cover on how Simpl is leveraging ML for fraud detection:

How Simpl is leveraging AI & ML to enhance fraud detection - Express Computer
With more than 20,000 merchant partners and 25 million users across India, Simpl is focused on ensuring seamless and safe transactions between users and merchants. Sheekha Verma, Director, Data Science, Simpl shares about the company’s robust anti-fraud infrastructure, including building data scienc…
Learn about AI advancements within Simpl

Data Science Team at Simpl

The data science team at Simpl consists of 28 data scientists and 16 data engineers. The team is a core part of Simpl along with other engineering teams, and they have a separate DevOps team. The team is working on  ML, neural network systems, rules, graph databases, and graph machine learning models to look at communities of fraud users.

The Tech Stack and Workflow of Simpl's Data Science Team

From a current tech stack perspective, the company has everything on the cloud, with no on-prem systems in place.

The data science team at Simpl uses a remote machine with Python notebook and libraries built by the data engineering team to connect to databases and perform exploratory data analysis (EDA). Once the data analysis is done, the team sets up a pipeline with the help of the data engineering team to deploy the model. For batch models, the team uses Airflow for scheduling.

Model monitoring is done using Simpl's dashboards to track output changes. In terms of MLOps, Simpl is currently investing in the area. For anti-fraud systems, the company has a model that uses batch systems for analyzing similar email ids and phone numbers. The team also has some tools that run in real-time for monitoring transactions based on the velocity of the transaction and the amount being transacted.

Simpl also deployed a neural network model for transaction monitoring. The model combines current payload with historical data from the last one year and pushes it into the neural net model for a decision on whether to allow or decline the transaction. The data engineering team built a Flink pipeline to manage the peak traffic and ensure a low SLA of 70-80 milliseconds.

📌
Feature Store:
A feature store is a centralized repository for storing and managing features, which are individual measurable properties or characteristics of data that are used to train machine learning models.

Simpl currently uses DynamoDB as a feature store for real-time availability. However, this is expensive, and there are efforts to build an internal feature store to bring down costs in the long term.

We found this interesting blog on how Data Science evolved at Simpl:

Journey of Data Sciences Lab @ Simpl
When we started in 2015, from our tiny office in Juhu (Mumbai), we pretty much had single person teams. We were seven people strong…

Managing Costs for ML Models: Challenges and Solutions

Managing the costs associated with implementing and scaling machine learning (ML) models is a critical challenge. It is especially important for models that require significant amounts of data and use expensive resources such as Flink pipelines and virtual machines.

The ML team deals with terabytes of data, which necessitates the use of virtual machines for training jobs. Balancing the costs against the benefits of the models is crucial.
To mitigate the costs, the team collaborates with DevOps and data engineering teams to explore cost-effective options. They have also been working on building an internal feature store to reduce the costs of using DynamoDB. Another cost-saving measure they employ is the use of on-spot instances for non-critical tasks.

However, managing costs is an ongoing process that requires continuous evaluation of the model's cost-effectiveness. Factors such as precision-recall balance and the cost of good users also come into play when deciding the best cost-saving measure

📌
Interaction between the ML and the DevOps Team:
Collaboration between DevOps and data science teams is necessary to provision virtual machines for machine learning projects, and there is typically a minimum of three days of lead time. The DevOps team receives multiple requests, including those from the data science team, which require consideration of cost and collaboration with the data engineering team to fulfill. In case of an urgent request, the DevOps team can expedite the provisioning process without considering the cost implications. The data science team accounts for the three-day time lag in the project deployment plan.

Managing Training and Inference Pipelines Separately: Pros and Cons

Managing the training and inference pipelines separately can lead to a range of problems that can affect the overall efficiency of the system. The primary reason for this is that it can make it difficult to track the models' origins, retain the codes, and replicate the results. It can also lead to human error and mushrooming of problems, especially in startups.

On the other hand, managing these pipelines separately can provide greater flexibility and control over the system, enabling you to optimize each process independently. It can also allow you to scale the system more easily by adding new resources to the training or inference pipelines as needed.

However, ideally, you'd want to merge these pipelines and incorporate retraining in the same process. By doing so, you can avoid the issues associated with managing these pipelines separately. You will still be able to maintian the flexibility and control that comes with managing them independently. Overall, the decision to manage these pipelines separately or together depends on the specific needs of your organization and the resources available to you.

The Importance of Automation in Retraining ML Models

Retraining ML models is a crucial part of maintaining their accuracy and relevance. However, manual retraining can be time-consuming and prone to errors. That's why automation plays a vital role in ensuring that the process is efficient, reliable, and scalable.

Automating retraining can help organizations set specific intervals for triggering retraining, ensuring that the models are updated regularly. This can also help save time and resources, as automation eliminates the need for manual intervention.

However, there can be challenges in automating retraining for complex models that require specialized hardware or software. In such cases, manual retraining may be necessary until an automated solution can be implemented.

Simpl's foray into building in-house

Challenges of using SageMaker for Machine Learning projects

The use of SageMaker has been a game-changer for data science teams when it comes to handling large datasets for machine learning projects. However, the platform still presents some challenges that can impact the productivity of the team.  

  1. Resource allocation: When multiple people log on to SageMaker at the same time, loading a big file or model can crash the system for everyone. This affects not only the person who initiated the request but everyone else. This highlights the need for a system that can manage such issues on the team's end.
  2. Cost of running GPU: Cost of running GPU instances for neural network models, which are essential for processing large amounts of data can be very expensive, and the team has to be cautious when using them. To save costs, they have set up a system that shuts down the notebook if it is idle for a certain period. However, they hope to move to a more automated system that scales up and down depending on the usage.

Although SageMaker has been a useful platform for the team, there are still other options like Kubernetes that they have not tried yet. However, the decision to use SageMaker was mainly driven by the need for a faster system that could handle large amounts of data.

Plans to Build a Better Version of SageMaker

The company plans to create an improved version of SageMaker, their own machine learning platform. Initially an R&D experiment, the project now benefits from a larger team capable of in-house development. Although their virtual system possessed some SageMaker features, it lacked distributed computing. Adding distributed computing to their current virtual machine through Py console integration will provide the required solution.

For user access control management and data accessibility, the company has built various IAM roles and allocated a child account to their data team for cost management. However, they still require further work, particularly given the sensitive data they handle as a FinTech firm, and regular audits by RBI.

While they could use an external platform, the company has chosen to develop their version of SageMaker in-house. Their decision is strategic and not based on constraints related to data accessibility or cost. By having greater control over the platform, they can scale and grow more efficiently. The company has already used distributed computing in some systems via DAS.

As we're scaling and the team is getting bigger, if you can do it in-house, why not?
- Sheekha

Considerations for Real-time Systems and Data Science Models

  1. For real-time systems, strict SLAs must be adhered to, and load distribution can be non-uniform, with specific peak hours where workload can be high.
  2. When deploying a real-time system, it's essential to consider latency and load balancing.
  3. Data science models should be created for actual business impact, not for the sake of being "fancy."
  4. Metrics are used to measure the impact of a model, such as the amount of fraud it can stop and the number of good users it can impact.
  5. The risk team and CFO make a decision about what point they are comfortable with in terms of cost and business impact.
  6. Backend costs, such as the amount of DynamoDB writes and reads, must be considered and tied to the model's business metrics to ensure they align with the desired impact.

Making ML Deployment as Simple as Software: Improving Developer Productivity

Developing ML models has become easier with libraries like Scikit-learn, but the time to start a project and go live is still high, particularly for smaller companies without pipelines and MLOps systems. Setting up pipelines, cleaning data, validating tests, and deploying models can take two to three months. Moreover, finding bugs in a model is challenging as there is no standardization for the process. Therefore, companies need systems that make model development as seamless as software development to improve developer productivity. The system should allow for flexibility, easy integration, and building on top of the existing system. It should also have standardization for bug finding, monitoring data going in and out, and feedback loops.

The Importance of Ingraining Engineering Principles in Data Science

In the realm of data science, there has been a growing emphasis on the need for data scientists to possess engineering skills to ensure the successful and efficient deployment of ML models.

  1. Data scientists need to possess engineering skills to ensure efficient deployment of ML models. Good coding practices should be ingrained in data scientists to identify bugs that may affect the SLA of the model.
  2. Data scientists' love for certain tools, like Pandas, may result in slower performance when deployed in real-time. Data scientists need to be aware of the most efficient tools and their usage to ensure the efficient deployment of ML models.
You would want our data scientists to deploy everything and even filters.
- Sheekha

Additional Thoughts from Sheekha

MLOps: Build vs Buy

  1. Customization: Extensive customization may require building from scratch instead of adopting a third-party ML platform.
  2. Data sensitivity: Strict user access control management is crucial for companies dealing with sensitive data and may require an in-house system that can be customized for specific security requirements.
  3. Cost-consciousness: Building an in-house MLOps system may be more cost-effective for smaller companies, but they may eventually invest in third-party platforms for better ROI as the market matures.

LLMS

Sheekha expressed her interest in large language models (LLMs) and the new developments surrounding them, but at present, they are not using them in their work. However, she acknowledged that they are exploring interesting use cases for LLMs, particularly in their chatbot integration.

I definitely foresee a lot of interesting use cases for LLMs
- Sheekha

Read our previous blogs in the TrueML Series

True ML Talks #4 - Machine Learning Platform @ Salesforce
In this blog, we dive deep into Salesforce’s ML Platform. Understand how it solved both Software and ML deployment & understand its architecture

Keep watching the TrueML youtube series and reading all the TrueML blog series.


TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.