Machine Learning in Research versus Production

I have been going through the book ‘Designing Machine Learning Systems’ by Chip Huyen to better understand how machine learning systems are deployed in production in industry. Coming from a research background, it’s been a good experience to learn the difference in emphasis and priorities between how machine learning is practised in research versus in production. Here, I summarize some of my learnings on this topic:

1. Stakeholders and Requirements

Research: The number of stakeholders on a research project is small and they are usually aligned in terms of the project objective. The objective is often to achieve state-of-the-art performace on benchmark datasets. Model complexity is usually not an issue if it helps to eke out a tiny percentage of improvement on peformance metrics.

Production: There are often many stakeholders and their requirements can often be different and conflicting. It’s usually not possible to design an ML system satisfying all requirements, so a compromise must be reached. The value of performance gain is case-dependent. At times, a tiny improvement can save a lot of money for a business. However, often a small improvement does not justify the increase in model complexity.

2. Computational Priorities

A figure — Model development code is a small part of the full codebase needed for deployment. Adapted from Sculley et al. ¹

Research: A research group focuses most of its effort on model development. Due to the need to train many different models, research prioritizes fast training. In other words, high throughput is the desired quality in an ML system where throughput is the number of prediction requests that can be processed by the system in unit time.

Production: When deploying a model in production, the surrounding infrastructure of data pipelines, resource management, and servers has to be constructed and this takes a lot of time and effort. Once deployed, the focus is now on fast inference. Hence, we seek low latency in this case where latency is the amount of average waiting time before a prediction request is processed.

3. Data

Research: The benchmark datasets in research are usually clean, static, and well-formatted. The anomalies in the dataset, if present, are usually known and often open-source code to process the datasets is available.

Production By contrast, data in real-world systems can be noisy, unstructured, and its distribution can shift with time. There may be bias in the data. Labels may be sparse, imbalanced, or incorrect. The data may not be available as a complete dataset but may arrive over time in form of a stream as it gets generated by the user. Finally, one has to respect user privacy when dealing with their data.

4. Fairness

Research: Fairness is rarely given consideration in a research setting where the agenda is achieving state-of-the-art accuracy. Even when fairness is considered, it is an afterthought. This also has to do with lack of meaningful fairness metrics.

Production: A machine learning system deployed in the real-world can have influence on society members as a result of the decisions it outputs. Giving strong emphasis to fairness can ensure that no section of society gets adversely affected by ML models and that all sources of bias have been dealt with. Unfortunately, much progress remains to be done on this front as a lot of models deployed for loan applications, predictive policing, employee recruitement still discriminate against minority groups.

5. Interpretability

Interpretability is the property of an ML system to explain the rationale behind the decision it makes or the output it gives. If a system is interpretable, one can peek inside the model for better understanding; in case there is some mistake/anomaly, one can pinpoint with confidence where the model is going wrong and debug accordingly.

Research: With ML research being evaluated on a single objective, that of model performace, there is no incentive to ensure that the model is interpretable.

Production: Intrepretability is a key requirement for real-world ML systems. For us to deploy an ML system in society, we need to be able to trust it. This trust is highly dependent on the system being interpretable. In case of misbehavior, we want to be able to diagnose and rectify our model.

6. Monitoring and Maintenance

Research: In research, one often focuses solely on the model development part. The benchmark dataset is static. There is no monitoring to be done once good results on the benchmark dataset have been achieved. Advances in software and hardware, and accumulated common wisdom have made this process relative fast and cheap.

Production: The deployment of ML systems is still relatively fast. Once deployed, an ML system needs to be constantly monitored and maintained and this is the hard part. There are some aspects of ML systems that make them much more challenging to monitor and maintain as compared to traditional software systems, particularly the dependence of model performance on external data. At a high level, one needs to constantly ensure that the data pipeline does not get broken, the incoming data is sensible, and the test data distribution does not differ considerably from training distribution. In case of distribution drift, one needs to retrain the model.

Footnotes

Sculley, David, et al. “Hidden technical debt in machine learning systems.” Advances in neural information processing systems 28 (2015).↩︎