True ML Talks #6 - Machine Learning Platform @ °Nomad Health
We are back with another episode of True ML Talks. In this, we dive deep into °Nomad Health's ML Platform, and we are speaking with Liming Zhao.
Liming Zhao, is the CTO at °Nomad Health, a technology company revolutionizing the healthcare staffing industry. With clinicians in short supply, particularly in the face of a pandemic, °Nomad Health aims to provide a marketplace where healthcare professionals can find temporary assignments to meet the most urgent and attractive patient care needs.
- ML use cases in °Nomad Health
- Machine Learning Team at °Nomad Health
- Deploying Machine Learning Models
- Building a custom Feature Store solution
- Choosing MLOps Tools
- Manage Cloud Costs
Watch the full episode below:
ML Usecases @°Nomad Health
- Predictive Modelling: °Nomad Health has incorporated AI and ML into their operations, specifically in the area of predictive modeling. This helps prioritize work, given that clinicians are the scarcest resource. °Nomad Health invests a lot into this and considers it a machine learning capability.
- Recommendation Systems: °Nomad Health uses graph-based modeling to recommend attractive jobs to clinicians. They incorporate this into their ranking and outreach emails, ensuring that clinicians are presented with the most suitable jobs even if they do not have the time to explore all the available jobs.
- Large Language Models: °Nomad Health uses LLMs, such as GPT-3, to extract and augment job descriptions. They use the model to standardize job requirements, extracting meaningful information from a blob of text that may be written in varying degrees of detail and clarity. °Nomad Health is also exploring using LLMs on resumes, but this is a challenging area due to the varying degrees of completeness in clinical staffing.
°Nomad Health is exploring the use of large language models (LLMs) for job descriptions and resumes. The company has seen more success with robust models like GPT-3. However, using LLMs on resumes for clinical staffing presents challenges due to the need for specific certifications and licenses. °Nomad Health is working towards a comprehensive digital resume credential set using their platform to simplify the process for both clinicians and medical facilities.
Machine Learning Team at °Nomad Health
°Nomad Health's data science team is relatively small, consisting of nine team members, including a manager, a data scientist, a data analyst, and a data engineer. The remaining five members are machine learning engineers, two of which focus on infrastructure development and MLOps, while the other three focus on building, testing, and shipping models.
They leverage readily available solutions from other industries and reference problems, adapting them to specific use cases, and invest heavily in collecting, parsing, and standardizing data. °Nomad's team structure and collaboration practices enable them to move quickly and efficiently, with all members working together to solve problems. They have achieved significant success in MLOps, learning from their needs and bottlenecks, thanks to their data-driven approach and talented and diverse team.
Deploying Machine Learning Models
°Nomad Health invested heavily in Vertex AI, as the majority of the technology infrastructure was in Google Cloud Platform (GCP), but as they faced more complicated business needs and higher deployment frequency, they started to move the production service endpoint out of Vertex AI and deploy to the Google Kubernetes Engine (GKE) cluster. This allowed the team to have more flexibility, control, and scalability over their deployment and CI/CD pipeline.
°Nomad Health's machine learning team uses Vertex AI for model training, leveraging Vertex AI's rich set of libraries, interfaces, and tools to quickly try things, monitor success, and understand promising signals. The team is also evaluating ML Flow, but currently does not use DataBricks in their stack.
Overall, the evolution of °Nomad Health's deployment strategy stemmed from a realization of practical complexity and the adjacent engineering team's successful deployment on GKE. This strategy allowed the machine learning team to leverage the existing practices and infrastructure while still having control over their deployment.
Building a Custom Feature Store Solution
°Nomad Health has created a custom feature engineering solution to handle its large dataset and to build a more consistent feature store. The company realized that its different machine learning projects needed to share the same set of data, including information about clinician job views, application outcomes, and credentials or preferences. They formalized a feature store and created a team responsible for taking raw data, doing basic transformation, and then landing the data in a business-aligned way. To enable quick pivot to visualization, the BI organization could then use the transformed data. The data science team could also quickly extract a subset of features and call it to the feature store.
°Nomad Health is using the open-source solution Feast to extract and store features for different models, and the feedback from the modeling will go back to the feature store. The company is leveraging the Vertex AI for modeling and has a different pipeline for deployment. One of the most innovative solutions in the overall ML platform landscape is the transformation of raw data to a consistent entity, event, and dimension that the BI team and data science team could use for data analysis and predictive analysis, respectively. This transformation of raw data has allowed °Nomad Health to create a reliable signal that strongly correlates with the application and the rendering of an offer from the facilities.
Initially, we started with the Vertex AI infrastructure and eventually moved into our own open source implemented feature store, getting our proprietary data, the unique shape and set of data, is actually the key.
Choosing MLOps Tools
Early on, businesses should invest in a tool or platform that provides most of what they need, such as Vertex AI or SageMaker, so they can focus on getting their business value realized first. Once businesses have built a robust engineering or data science team, they can take production deployment out of the platform and add adjacent services. The journey a company goes through is more important than recommending a full set of tools, and that it's always best to pick something that works reasonably well for now and iterate from there.
If you only have two people starting with your data science team, and the first thing you do is to set up all your proprietary infrastructure. For what? What have you proven, this fancy engine and super powerful infrastructure could actually yell?
Manage Cloud Costs
- Invest in monitoring and alerting tools: Consider using tools such as TrueFoundry to monitor infrastructure performance and identify instances causing cost fluctuations. These tools can help to detect issues early on and take corrective action quickly.
- Rely on manual practices: Use manual practices, such as monitoring logs and signals piped into a Colab notebook, to identify instances causing cost fluctuations. Investigate these instances on a weekly or bi-weekly basis and restart or terminate certain training models as necessary.
- Set a budget and receive real-time reports: Set a budget for cloud costs and receive real-time reports from the cloud provider to ensure that the budget is not exceeded. This can help to keep costs under control and prevent unexpected expenses.
- Implement more sophisticated solutions: As the infrastructure grows, consider implementing more sophisticated solutions to manage costs effectively. This may include using automated tools or hiring dedicated personnel to manage cloud costs.
- Strike a balance between cost and performance: It is essential to strike a balance between cost and performance to achieve the desired outcomes. Consider optimizing machine learning workloads to ensure that they are cost-effective while still meeting performance requirements.
Additional Thoughts from Liming Zhao
MLOps: Build vs Buy
- Managed services vs. in-house infrastructure decision is critical for MLOps implementation. Hybrid approach recommended as organization matures.
- Long-term outcomes and tradeoffs between costs should be considered for cost and resourcing evaluation. Monitor costs carefully and move high-performance components to proprietary infrastructure for stable products.
- For less reliable models, tolerate cost fluctuation but use tags for cost attribution and monitor price fluctuations for effective cost optimization.
Importance of Adapting to Changing Business Needs
During the pandemic, °Nomad Health had to prioritize the most likely offers to render to manage the influx of job applications. However, as people became more hesitant to apply for jobs, the recommendation engine had to be adjusted to show candidates more options.
In retrospect, °Nomad Health's initial focus on speed and autonomy was the right decision for a small team with uncertain business needs. However, as the team and business needs evolved, the company had to shift its focus to accuracy and efficiency.
This journey highlights the importance of considering changing business situations when making machine learning decisions. By being agile and willing to adapt, businesses can make informed decisions that enable them to evolve with the changing business landscape.
If you enjoyed the following blog, here are the previous blogs in the TrueML Series
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.