HACKER Q&A
📣 mlthoughts2018

How does your data science or machine learning team handle DevOps?


Machine learning teams often face operating needs not seen in many other domains.

Some example:

- instrumenting observability that not only monitors data quality and upstream ETL job status, but also domain specific considerations of training ML models, like overfitting, confusion matrices, business use case accuracy or validation checks, ROC curves and more (all needing to be customized and centrally reported per each model training task).

- standardizing end to end tooling for special resources, eg queueing and batching to keep utilization high for production GPU systems, high RAM use cases like approximate nearest neighbor indexes, and just run of the mill stuff like how to take a trained model and deploy it behind a microservice in a way that bakes in logging, tracing, alerting, and more.

Machine learning engineers and data scientists tend to have a comparative advantage when they can focus on understanding the data, running experiments to decide which models are best, pairing with product managers or engineers to understand constraints around the user experience, and designing software tools and abstractions around unique training or serving architectures (like the GPU queuing example).

Increasingly teams of data scientists are required to do devops work configuring and maintaining eg kubernetes & CI/CD workloads, alerting and monitoring, logging, instrumenting security or data access control compliance solutions.

This is harmful because it reduces the time or effort these engineers can spend on their comparative advantages, a direct loss to the customer or user, at the expense of doing devops jobs they are not trained to do and not interested in (which leads data scientists to burnout often) and that many other non-specialists can do.

How do you structure teams, build tools and establish compliance or operations expectations that allow data scientists and related statistical scientists and ML backend engineers to flourish?


  👤 viig99 Accepted Answer ✓
ML engineer here, team started as a research team, now that we have things in production and have a lot of devops, engineering work, we bifurcated into pods and work on specific bits and pieces, lot of constant fire-fighting though. Re-wrote entire stack from python to C++ threadpool async grpc (is thrift the only good threadpool server implementation available ?), deployed on openshift, used vector + influx + grafana for dashboards / internal model monitors, elastic search for loggings, lot of other tools for validation, filtering for potential training candidates etc. Right now working CI/CD for ml, during training if model finds a better model based on different validation sets, have one click deployment ready for approval etc

👤 lettergram
Depends on what you mean by machine learning. In deep learning applications optimizations and getting the thing to train effectively is 90% of the job.

My team manages a platform / framework we built similar to FloydHub (we wrote a django app that integrates with AWS, but any cloud provider would do) and another similar to SigOpts (we built a server-client system that utilizes the first systems APIs to deploy nodes)[1]. This lets us effectively develop and then hyperparameter tune our models. Finally, we deploy them within a library and flask app. This makes them easily digestible across the enterprise.

We are a team of three that leverage other teams to make improvements (aka “inner sourcing” development). It’s a busy job, but managing the whole stack gives us the ability to develop models much faster and effectively. With multiple teams utilizing our frameworks we have a kind of critical mass to keep everything running quite well.

[1] https://medium.com/capital-one-tech/system-language-agnostic...


👤 hprotagonist
would you accept “very badly”?

a colleague of mine attempted to share a dropbox link of a git repo and working directory he had helpfully zipped up. including 4 datasets.

So in order to get the 50 lines of code i was meant to merge in, he thought it was reasonable to have me download a headless 4 GB.

I told him no.


👤 navbaker
We have been fortunate enough to be able to hire people for our group of about 60 that fall into two categories:

1) Researchers/mathematicians

2) DevOps/software engineers that are fluent enough in ML/AI processes and methodologies that they can listen to what group 1 says they need and implement a system that will efficiently solve their problem.


👤 was_boring
I work on part of this problem on my team at my job. How we are organized is specific teams do exactly what you describe.

I work on data ingestion and getting it to a safe place as quickly as possible (this has more senior members then other teams because of it’s importance). Another team does ETL on it, another for data storage and access, another for ML governance, another for creating the models, and so on.

We focus on our domain and communicate. Seems to work well.


👤 firstfewshells
All the companies I've worked at have a dedicated Platform team that does all of the things that you mentioned.

👤 atmosx
They have a dedicated devops team who handles infrastructure and operations on top of AWS.

👤 msapaydin
Some projects started by researchers from Stanford University seem to address these issues. Some keywords I have come across are MLflow, sisu and Databricks. The last one is aka spark. Sisu is a company I did not try, and I had trouble working with MLflow however the ideas are worth taking a look.

👤 theo31
I couldn't agree more, I started building a ML hosting platform to solve your second point. I'm thinking of building a managed NN service as it is a common pain point.

https://inferrd.com


👤 LSTMeow
Disclosure/plug: Evangelist for AllregoAi here, but I'm going to only allude to our FREE open-source platform+devops solution - Allegro Trains- https://github.com/allegroai/trains

*

1100% Agree with you about unnecessary time spent on configuration and maintenance.

As a research-oriented professional, you need something that will seamlessly integrate with _your own_ flow.

We are in the ML stone-age, the playbook is not really written yet. Currently, CI/CD + agile is (necessary?) overhead that costs us precious time-to-product.

Here is my manifesto:

1. Anything related to "production" should be taken care of by DevOps peeps, yes even if it is "MLOps". Monitoring, standardization etc should not be your responsibility. If it is somehow on you, then it should be part of the same experimentation platform you are using. Extra tools? Extra people.

2. Likewise, anything related to data-engineering, preparation etc. should be compartmentalized and have separate version control (it is not as complicated as doing it the DVC way, BTW). If you do have to do these tasks - you guessed it - it should be part of the same experimentation platform you are using.

3. Research MLOps (ResOps?): Did I say experimentation platform? Any team member should be able to work as she wishes - Notebooks, scripts, whatnot. And if you forget to commit something before you run? You want to know about all the changes. Sharing? Comparing? - must have. Reproducible experimentation? Need to be able to automatically track environment variables, packages installed etc. Most importantly - Need to be able to offload to the cluster in the same running environment with a button click. I am not going to spend hours deciphering logs to find out that the wrong version of package was installed in our container. I am not going to spend days sorting out containers to find "the one that works"

4. Lastly - IT work ("devops") on cluster management: Monitoring your GPU usage per task, scheduling experiments, early stopping with a button click, on-prem managed platform - WHY IS THIS OUR JOB? - well, it isn't. But if it is, it should be integrated with your platform, and day-to-day operations should be "automagical", cluster config should be done once, by professionals (even outsourced help).

If you feel me here, then know that you are REALLY not alone. We took to heart what our clients & friends told us, and we launched Allegro Trains as a solution for all of this. Magically simple, and FREE.

Sorry for caps, I tend to be emotional on this ;) Hit me up on twitter @LSTMeow


👤 Jugurtha
Like anyone working in that space who wants to keep their sanity, we're building our machine learning platform[0]. We shipped many projects and it can be taxing to work on different stacks, especially given the fact that we built for enterprise and they always want complete solutions. The model is but a foot in the door, and you must do custom front-end/back-end/model/pipe/data acquisition.

We decided to build tooling around that workflow. We shipped and were paid with that workflow, so we wanted to make it efficient and effective. Other solutions didn't fit our needs. Straight out of https://xkcd.com/927/, except it needs to address our use cases. We backtest with our past projects and use it for our current ones to scale our consulting capabilities.

>- standardizing end to end tooling for special resources, eg queueing and batching to keep utilization high for production GPU systems, high RAM use cases like approximate nearest neighbor indexes, and just run of the mill stuff like how to take a trained model and deploy it behind a microservice in a way that bakes in logging, tracing, alerting, and more.

We do schedule notebooks[1]. We also publish AppBooks[2], which are automatically parametrized notebooks: it automatically generates a form so anyone can set variables the notebook author's chose to expose and run a notebook without changing the code. Extremely useful when you want to have a domain expert tweak a domain specific variable, without them having to know what a notebook is. In some projects, there's someone with deep, deep expertise in a field for whom a variable is really important, but that variable gets dismissed by the ML practitioner because they didn't see a correlation or an impact on AUC or something, so the domain expert has an input, whether on relevant variables, or the real world metrics we're working for. We also added instrumentation for the basic CPU/RAM/GPU, data, servers running, etc. Again, so that our teammates don't bother with this. We use different Docker images for notebook servers with some that are 30GB so members don't bother with dependencies, GPU/tensorflow/cuda and version conflicts.

We automatically track metrics and parameters without the notebook's author writing boilerplate code, because they forget. The models are saved, and the ML practitioner can click on a button to deploy a model[3]. We stressed a lot on self service because its absense put a lot of stress on us: an ML practitioner wants to deploy a model, asks someone in the team who's probably busy. So we said: anyone should click and deploy. This is also useful because a developer downstream will only need to send in HTTP requests to interact with the model. We used to have application developers who also needed to know more than they should have on the internals/dependencies of the models. Not anymore.

We also added near real-time collaboration/editing so many people can work on the same notebook, especially useful when a team member is struggling to implement something, and others chime in to help debug/refactor, and review[4]. Everyone sees everyone's cursor for better awareness of what's being done. A use case is an ML practitioner playing with a paper who's struggling on the algorithms or data structures part of the paper. They can sollicit another team member who'll chime in and help. One of the features that's useful is the multiple checkpoints[5], which allows one to revert to an arbitrary checkpoint, not just the last one. This, again, is porcelain because an ML practitioner playing with git is a context switch, and they don't really like it.

So, we've done a few things to make our life easier. The workflow isn't perfect, and the tools aren't perfect, but we're removing friction.

We add applications for the usual timeseries forecasting, sentiment analysis, anomaly detection, churn, to leverage projects we did in the past.

[0]: https://iko.ai

[1]: https://iko.ai/docs/notebook/#long-running-notebooks

[2]: https://iko.ai/docs/appbook/

[3]: https://iko.ai/docs/appbook/#deploying-a-model

[4]: https://iko.ai/docs/notebook/#collaboration

[5]: https://iko.ai/docs/notebook/#multiple-checkpoints


👤 akx
I'm the CTO at Valohai (we almost got in YC some years back!) - we solve many of these issues to let data scientists focus on the interesting bits. See https://valohai.com :)