Some example:
- instrumenting observability that not only monitors data quality and upstream ETL job status, but also domain specific considerations of training ML models, like overfitting, confusion matrices, business use case accuracy or validation checks, ROC curves and more (all needing to be customized and centrally reported per each model training task).
- standardizing end to end tooling for special resources, eg queueing and batching to keep utilization high for production GPU systems, high RAM use cases like approximate nearest neighbor indexes, and just run of the mill stuff like how to take a trained model and deploy it behind a microservice in a way that bakes in logging, tracing, alerting, and more.
Machine learning engineers and data scientists tend to have a comparative advantage when they can focus on understanding the data, running experiments to decide which models are best, pairing with product managers or engineers to understand constraints around the user experience, and designing software tools and abstractions around unique training or serving architectures (like the GPU queuing example).
Increasingly teams of data scientists are required to do devops work configuring and maintaining eg kubernetes & CI/CD workloads, alerting and monitoring, logging, instrumenting security or data access control compliance solutions.
This is harmful because it reduces the time or effort these engineers can spend on their comparative advantages, a direct loss to the customer or user, at the expense of doing devops jobs they are not trained to do and not interested in (which leads data scientists to burnout often) and that many other non-specialists can do.
How do you structure teams, build tools and establish compliance or operations expectations that allow data scientists and related statistical scientists and ML backend engineers to flourish?
My team manages a platform / framework we built similar to FloydHub (we wrote a django app that integrates with AWS, but any cloud provider would do) and another similar to SigOpts (we built a server-client system that utilizes the first systems APIs to deploy nodes)[1]. This lets us effectively develop and then hyperparameter tune our models. Finally, we deploy them within a library and flask app. This makes them easily digestible across the enterprise.
We are a team of three that leverage other teams to make improvements (aka “inner sourcing” development). It’s a busy job, but managing the whole stack gives us the ability to develop models much faster and effectively. With multiple teams utilizing our frameworks we have a kind of critical mass to keep everything running quite well.
[1] https://medium.com/capital-one-tech/system-language-agnostic...
a colleague of mine attempted to share a dropbox link of a git repo and working directory he had helpfully zipped up. including 4 datasets.
So in order to get the 50 lines of code i was meant to merge in, he thought it was reasonable to have me download a headless 4 GB.
I told him no.
1) Researchers/mathematicians
2) DevOps/software engineers that are fluent enough in ML/AI processes and methodologies that they can listen to what group 1 says they need and implement a system that will efficiently solve their problem.
I work on data ingestion and getting it to a safe place as quickly as possible (this has more senior members then other teams because of it’s importance). Another team does ETL on it, another for data storage and access, another for ML governance, another for creating the models, and so on.
We focus on our domain and communicate. Seems to work well.
*
1100% Agree with you about unnecessary time spent on configuration and maintenance.
As a research-oriented professional, you need something that will seamlessly integrate with _your own_ flow.
We are in the ML stone-age, the playbook is not really written yet. Currently, CI/CD + agile is (necessary?) overhead that costs us precious time-to-product.
Here is my manifesto:
1. Anything related to "production" should be taken care of by DevOps peeps, yes even if it is "MLOps". Monitoring, standardization etc should not be your responsibility. If it is somehow on you, then it should be part of the same experimentation platform you are using. Extra tools? Extra people.
2. Likewise, anything related to data-engineering, preparation etc. should be compartmentalized and have separate version control (it is not as complicated as doing it the DVC way, BTW). If you do have to do these tasks - you guessed it - it should be part of the same experimentation platform you are using.
3. Research MLOps (ResOps?): Did I say experimentation platform? Any team member should be able to work as she wishes - Notebooks, scripts, whatnot. And if you forget to commit something before you run? You want to know about all the changes. Sharing? Comparing? - must have. Reproducible experimentation? Need to be able to automatically track environment variables, packages installed etc. Most importantly - Need to be able to offload to the cluster in the same running environment with a button click. I am not going to spend hours deciphering logs to find out that the wrong version of package was installed in our container. I am not going to spend days sorting out containers to find "the one that works"
4. Lastly - IT work ("devops") on cluster management: Monitoring your GPU usage per task, scheduling experiments, early stopping with a button click, on-prem managed platform - WHY IS THIS OUR JOB? - well, it isn't. But if it is, it should be integrated with your platform, and day-to-day operations should be "automagical", cluster config should be done once, by professionals (even outsourced help).
If you feel me here, then know that you are REALLY not alone. We took to heart what our clients & friends told us, and we launched Allegro Trains as a solution for all of this. Magically simple, and FREE.
Sorry for caps, I tend to be emotional on this ;) Hit me up on twitter @LSTMeow
We decided to build tooling around that workflow. We shipped and were paid with that workflow, so we wanted to make it efficient and effective. Other solutions didn't fit our needs. Straight out of https://xkcd.com/927/, except it needs to address our use cases. We backtest with our past projects and use it for our current ones to scale our consulting capabilities.
>- standardizing end to end tooling for special resources, eg queueing and batching to keep utilization high for production GPU systems, high RAM use cases like approximate nearest neighbor indexes, and just run of the mill stuff like how to take a trained model and deploy it behind a microservice in a way that bakes in logging, tracing, alerting, and more.
We do schedule notebooks[1]. We also publish AppBooks[2], which are automatically parametrized notebooks: it automatically generates a form so anyone can set variables the notebook author's chose to expose and run a notebook without changing the code. Extremely useful when you want to have a domain expert tweak a domain specific variable, without them having to know what a notebook is. In some projects, there's someone with deep, deep expertise in a field for whom a variable is really important, but that variable gets dismissed by the ML practitioner because they didn't see a correlation or an impact on AUC or something, so the domain expert has an input, whether on relevant variables, or the real world metrics we're working for. We also added instrumentation for the basic CPU/RAM/GPU, data, servers running, etc. Again, so that our teammates don't bother with this. We use different Docker images for notebook servers with some that are 30GB so members don't bother with dependencies, GPU/tensorflow/cuda and version conflicts.
We automatically track metrics and parameters without the notebook's author writing boilerplate code, because they forget. The models are saved, and the ML practitioner can click on a button to deploy a model[3]. We stressed a lot on self service because its absense put a lot of stress on us: an ML practitioner wants to deploy a model, asks someone in the team who's probably busy. So we said: anyone should click and deploy. This is also useful because a developer downstream will only need to send in HTTP requests to interact with the model. We used to have application developers who also needed to know more than they should have on the internals/dependencies of the models. Not anymore.
We also added near real-time collaboration/editing so many people can work on the same notebook, especially useful when a team member is struggling to implement something, and others chime in to help debug/refactor, and review[4]. Everyone sees everyone's cursor for better awareness of what's being done. A use case is an ML practitioner playing with a paper who's struggling on the algorithms or data structures part of the paper. They can sollicit another team member who'll chime in and help. One of the features that's useful is the multiple checkpoints[5], which allows one to revert to an arbitrary checkpoint, not just the last one. This, again, is porcelain because an ML practitioner playing with git is a context switch, and they don't really like it.
So, we've done a few things to make our life easier. The workflow isn't perfect, and the tools aren't perfect, but we're removing friction.
We add applications for the usual timeseries forecasting, sentiment analysis, anomaly detection, churn, to leverage projects we did in the past.
[0]: https://iko.ai
[1]: https://iko.ai/docs/notebook/#long-running-notebooks
[2]: https://iko.ai/docs/appbook/
[3]: https://iko.ai/docs/appbook/#deploying-a-model