In recent years, applying AI models in products to solve problems has become a popular practice. In fact, we’ve seen many successful applications, such as Tiktok’s recommendation algorithms, and the deep-learning based self-driving industry. Some apps also contain easy-to-use AI-based features like Word’s OCR (Optical Character Recognition) feature. But with a deeper understanding of machine learning algorithms, we’ve found that applying machine learning algorithms is no simple matter.
One difficulty comes from the general characteristics of the machine learning model: it needs to learn how to solve problems from a lot of historical data. This process is generally called "training" and the result of training is never 100% accurate. For deep learning models, the trained model is not interpretable, that is, we don't know how it understands the input data and gives such results.
Therefore, we cannot predict the performance of the model before actually letting the model learn from historical data; even if we were using the same data, the final results of different models would vary drastically, such as linear models and tree models. Therefore, when applying machine learning, we first need to spend a considerable amount of energy to process the original data, train a variety of candidate models, and select the most suitable one based on certain metrics from the model results. A data scientist is generally needed to do this, and because of the high uncertainty of this part of the work, various possibilities need to be explored. I call this stage the "exploration stage."
The exploration stage is only half the work. It gives us data processing procedures and suitable models that have been verified by existing data, and the next step is to productionize the data preprocessing and model. To complete this phase, collaboration between data scientists and data engineers is required. I call this stage the "productization stage."
The productization of machine learning models doesn’t just stop at deploying and using the model. The model learns from the historical data we prepared, however, the data generated in the production environment isn’t always consistent with the historical data. This phenomenon is called Data Drift, which normally would cause the model performance to decay as new data is collected. Imagine a marathon runner who trains on a flat road every day, only to be faced with a rugged mountain road at the official race. The runner would be highly unlikely to get the same results they usually get.
To solve this problem, we need to accumulate data in the production environment and observe the model metrics we care about. If these metrics drop significantly, we need to use the newly accumulated data to retrain and update the model. Of course, an alternative and easier approach is to re-train the model regularly, rather than relying on monitoring results of the data distribution in the production environment.
Both the exploration phase and the productization phase require constant iteration. Different phases also have different challenges. We’ll explore these challenges in more detail in the next part of this article.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.