CD4ML with Jenkins in DAGsHub
By
Published: June 08, 2021
This is the first part of two-part series blog, discussing how you can achieve continuous delivery in machine learning (CD4ML) using Jenkins and DVC pipelines.
In this blog we will discuss how to setup this automation process for your project and few use-cases which are achieved as a byproduct of this automation.
For an in depth explanation of the stages of this automation process, see part 2 of this blog post
On the way, I’ll show you a simple and elegant way of managing your experiments, based on Git branches and commits, which will make your life much better, especially when working in a team on complex projects.
Automation in a delivery project helps us introduce standards of working and seamless collaboration practices; it ensures product quality, eliminates tedious repetitive manual efforts, and mainly reduces our time to production. Software delivery projects have evolved over the span of two to three decades, and good practices/principles have emerged with time. Meanwhile, data science and machine learning projects are still in an evolving space, where people are trying out different methods and sharing what works for them. Data science is already a vast space, with new research getting published on a day-to-day basis, and DS projects are typically dynamic in nature, so the ideal workflow and automation tools haven't been consolidated yet.
This post will show how we can apply automation which speeds up your research, ensures reproducibility and improves handling of results.
We will be discussing two major use cases along with other best practices that can be applied in DS projects. The first is how you can achieve remote training of your models, i.e schedule an automation process to run your experiment/training job on a remote server. The second is how you can add transparency to the pull request review process, where you would be able to compare experiments to better judge the code changes. Both of these use cases help you standardize your experiment/research execution and evaluation, plus achieve seamless collaboration in the team.
There is a lot of knowledge and information on Jenkins and how to use it around the web. For this post, I’ll assume you already know how to configure pipeline jobs in Jenkins UI, save credentials as secrets, and how to use them in pipeline steps.
Before you move on, these are few things you should already have set up:
Addressing the elephant in the room, the Jenkins pipeline is a continuous integration/delivery pipeline. We use it to automate our build process, which ensures safe and seamless integration of changes from the team to the main branch, and also automates the delivery. We define our build steps and checks in a file called Jenkinsfile in the root directory of our project, which Jenkins server understands and executes for us.
In machine learning projects, the DVC pipeline defines steps to run our experiments. Usually, the DVC pipeline involves stages like data ingestion, processing, modelling, evaluation/prediction and the like. DVC lets us define our ML pipeline in the dvc.yaml file and helps us run the ML experiments. DVC is a natural choice for ML pipeline, as it versions the experiment results, optimizes the stages to run, by checking stage dependencies and deciding what should and shouldn’t run.
The blog post will guide you on how you can automate running your experiment. i.e running DVC pipelines in your CI/CD, Jenkins pipeline.
We want to create a multibranch pipeline job so that our build runs for all the branches, i.e experiments of all the branches are run and compared when the pull request is raised. To create a multibranch pipeline job, you can follow along with creating a Jenkins pipeline documentation.
After creating a job, open the ‘configure’ tab to verify your project repository URL and the path of the Jenkinsfile inside your repository. This is how Jenkins will know, for each branch, what to execute as part of the build process.
Figure 1: Configure the repository URL with behaviors to discover branches and pull requests
Figure 3: End-to-end Jenkins pipeline stages
Jenkins will commit the results back (metrics to Git and data/models to DVC). You can review them as follows:
If you use DAGsHub to manage your repo, you can use its Data Science Pull Requests features to automatically see all the new data, models, and experiment results which were generated and committed by Jenkins, directly in your browser, without needing to clone the project to your machine.
Now you have the latest metrics, data, and models on your machine courtesy of Jenkins. You can review them and/or develop further on.
Sometimes we want to ignore the current execution of the experiment. This could be because of a few reasons like, our experiment branch is not up-to-date with the master branch, i.e we have to rebase our experiment branch with latest changes in master before it can be merged. Or, there was a code/data bug, which we realized after the execution of the experiment from Jenkins. In either case, we want to discard the existing experiment run in our branch. Then re-run the experiment, once we have ensured we have the latest(from master) and greatest (bug fixed) code.
This is analogous to rebuilding the artifact in any normal software delivery project with latest and bug free code. All we need to do in such a case is ignore the experiment save commit from Jenkins and force-push our latest changes.
Caution: Force push, rebase will rewrite our git history. These are powerful double edged swords :). We should use it only when we know what we are doing.
In my opinion, there are two types of pull request review processes.
Black Box Review:
Reviewing only code changes, and making sure all tests are passing. In Machine Learning, this is not enough because these changes can have more subtle meaning than ”good" or "bad”.
Transparent Review:
We should be able to compare the effect and implications of the changes we made by comparing experiments. Hence in the Pull Request review process, we should compare experiments, to increase transparency.
Experiment comparison can involve comparing changes in metrics, hyper-parameters, data distribution, data cleaning and/or model algorithms/architecture. To begin with, we can compare metrics and take it forward from there.
We can compare metrics tracked at the end of the experiment with dvc metrics diff command. This is done as part of the Update DVC Pipeline stage defined in our pipeline. With DAGsHub it will automatically detect all the experiments which were run as part of the pull request and makes it easier for us to compare them with experiments in the main branch. To learn more, check out Data Science Pull Requests from DAGshub.
This article first appeared on DAGsHub.com.
In this blog we will discuss how to setup this automation process for your project and few use-cases which are achieved as a byproduct of this automation.
For an in depth explanation of the stages of this automation process, see part 2 of this blog post
TL; DR
- Data science projects are unique but still can adopt many of the learnings from software delivery principles and methodologies.
- MLOps comes under the umbrella of DevOps. It addresses the rituals to take models running in data scientists’ laptops to production.
- Versioning your data and models is the first step in achieving reproducible results and DVC has done a great job on it.
- Adding to it, data science pull request from DAGsHub and CML — Continuous Machine Learning from DVC, addresses a few nuances of data science projects, and applying standard CI/CD practices to ML projects.
- While continuous delivery for machine learning (CD4ML) gives us standard practices and principles around delivery in machine learning, CML is one possible implementation, relying on Github Actions.
- In this blog we will be extending the ideas of CML, and implement it using Jenkins pipeline and DVC pipelines, with the help of DAGsHub's Jenkins plugin.
- The core of this post is how we automate running DVC pipelines, as part of the Jenkins CI/CD pipeline. If there’s one thing you take from this blog; it should be this Jenkinsfile.
On the way, I’ll show you a simple and elegant way of managing your experiments, based on Git branches and commits, which will make your life much better, especially when working in a team on complex projects.
Need for automation
Automation in a delivery project helps us introduce standards of working and seamless collaboration practices; it ensures product quality, eliminates tedious repetitive manual efforts, and mainly reduces our time to production. Software delivery projects have evolved over the span of two to three decades, and good practices/principles have emerged with time. Meanwhile, data science and machine learning projects are still in an evolving space, where people are trying out different methods and sharing what works for them. Data science is already a vast space, with new research getting published on a day-to-day basis, and DS projects are typically dynamic in nature, so the ideal workflow and automation tools haven't been consolidated yet.This post will show how we can apply automation which speeds up your research, ensures reproducibility and improves handling of results.
Treating experiments like potential new features in a software project opens up many possibilities for improving our engineering practices.
We will be discussing two major use cases along with other best practices that can be applied in DS projects. The first is how you can achieve remote training of your models, i.e schedule an automation process to run your experiment/training job on a remote server. The second is how you can add transparency to the pull request review process, where you would be able to compare experiments to better judge the code changes. Both of these use cases help you standardize your experiment/research execution and evaluation, plus achieve seamless collaboration in the team.
Prerequisite
There is a lot of knowledge and information on Jenkins and how to use it around the web. For this post, I’ll assume you already know how to configure pipeline jobs in Jenkins UI, save credentials as secrets, and how to use them in pipeline steps.Before you move on, these are few things you should already have set up:
- Set up a running Jenkins server, which executes your CI pipeline. You can follow the instructions in JenkinsDockerSetup to do so, or just set it up normally.
- Few essential Jenkins plugins are: Docker Pipeline plugin, to run our jobs as containers and Branch Source plugin (Github, DAGsHub), to discover branches and pull requests from the repository.
- You have an end-to-end machine learning DVC Pipeline, that you want to run with Jenkins Pipeline. If you don’t have one, you can use my example project that I created for this post.
A note about pipelines
I know we are talking about Jenkins pipeline, DVC pipeline, automating one pipeline to run another pipeline. There you go, I said it, the tongue twister of the post.Addressing the elephant in the room, the Jenkins pipeline is a continuous integration/delivery pipeline. We use it to automate our build process, which ensures safe and seamless integration of changes from the team to the main branch, and also automates the delivery. We define our build steps and checks in a file called Jenkinsfile in the root directory of our project, which Jenkins server understands and executes for us.
In machine learning projects, the DVC pipeline defines steps to run our experiments. Usually, the DVC pipeline involves stages like data ingestion, processing, modelling, evaluation/prediction and the like. DVC lets us define our ML pipeline in the dvc.yaml file and helps us run the ML experiments. DVC is a natural choice for ML pipeline, as it versions the experiment results, optimizes the stages to run, by checking stage dependencies and deciding what should and shouldn’t run.
The blog post will guide you on how you can automate running your experiment. i.e running DVC pipelines in your CI/CD, Jenkins pipeline.
Jenkins pipeline for CI
We want to create a multibranch pipeline job so that our build runs for all the branches, i.e experiments of all the branches are run and compared when the pull request is raised. To create a multibranch pipeline job, you can follow along with creating a Jenkins pipeline documentation.After creating a job, open the ‘configure’ tab to verify your project repository URL and the path of the Jenkinsfile inside your repository. This is how Jenkins will know, for each branch, what to execute as part of the build process.
Figure 1: Configure the repository URL with behaviors to discover branches and pull requests
Figure 2: Configure the path to Jenkins file inside your repository
Stages
Here are a few stages that we will be defining in our Jenkins pipeline:Figure 3: End-to-end Jenkins pipeline stages
- Run unit tests
- Run linting tests
- DVC specific stages
- Setup DVC remote connection
- Sync DVC remotes
- On pull request
- execute end-to-end DVC experiment/pipeline
- compare the results
- commit back the results to the experiment/feature branch
Use cases
Once you have the above setup running for your project, let's discuss a few handy use-cases which are achieved by this automation.Using Jenkins for remote training
Quite often in DS projects, training of models is a time and resource-consuming process. Ideally, we would like to make a few changes and schedule these training jobs to run offline, without obstructing our work. This is achievable through remote training jobs, where you schedule training jobs to run on a remote, high-performance machine, and the job replies back with the results for review after completion.Figure 4: Remote training with DAGsHub and Jenkins
Reasons you would want to do remote training of your models are:- Everyone loves automation
- Your model training can be a very time-consuming process. You want to schedule a training job and then get notified with results when the job is finished
- GPUs and compute needs for training are not present in your local development environment
- To eliminate costly data transfers between storage to the job environment, you want to run a training job as close to the data source as possible
- Due to working from home and low network bandwidth constraints, you want to work on cloud machines to reduce your network load and latency
- Standardization of the environment and process to make sure everyone is measuring the same thing on the same machine
Figure 5: Jenkins commit to save experiment results and updating pull request
Reviewing experiment results
If we use Jenkins for remote training, it’s important to consider how we evaluate the results of a training session.Jenkins will commit the results back (metrics to Git and data/models to DVC). You can review them as follows:
Using DAGsHub
If you use DAGsHub to manage your repo, you can use its Data Science Pull Requests features to automatically see all the new data, models, and experiment results which were generated and committed by Jenkins, directly in your browser, without needing to clone the project to your machine.Figure 6: Using DAGsHub to see the new experiments and models generated by Jenkins
Using the command line
Since these are all open source formats, and everything is stored in the Git history, you can also just clone and pull everything locally to look at it:
git pull origin {feature/experiment branch}
# 1: Fetches Jenkins commit, i.e metadata (metrics and dvc.lock file).
dvc pull -r origin
# 2: Now you fetch the data/models from DVC storage.
Now you have the latest metrics, data, and models on your machine courtesy of Jenkins. You can review them and/or develop further on.
Ignoring experiments
Sometimes we want to ignore the current execution of the experiment. This could be because of a few reasons like, our experiment branch is not up-to-date with the master branch, i.e we have to rebase our experiment branch with latest changes in master before it can be merged. Or, there was a code/data bug, which we realized after the execution of the experiment from Jenkins. In either case, we want to discard the existing experiment run in our branch. Then re-run the experiment, once we have ensured we have the latest(from master) and greatest (bug fixed) code.This is analogous to rebuilding the artifact in any normal software delivery project with latest and bug free code. All we need to do in such a case is ignore the experiment save commit from Jenkins and force-push our latest changes.
git push origin {feature/experiment branch} --force
Caution: Force push, rebase will rewrite our git history. These are powerful double edged swords :). We should use it only when we know what we are doing.
Enhancing the pull request review process
Data Science projects are more dynamic than Software Delivery projects. The main reason for that is, in a DS project you have all the factors that influence a Software Development Project, plus a set of factors that are unique to the data world. To name a few:- Quality and volume of Data
- Data cleaning/processing steps
- Model complexity and explainability
- Technical and Business Metrics
In my opinion, there are two types of pull request review processes.
Black Box Review:
Reviewing only code changes, and making sure all tests are passing. In Machine Learning, this is not enough because these changes can have more subtle meaning than ”good" or "bad”.
Transparent Review:
We should be able to compare the effect and implications of the changes we made by comparing experiments. Hence in the Pull Request review process, we should compare experiments, to increase transparency.
Experiment comparison can involve comparing changes in metrics, hyper-parameters, data distribution, data cleaning and/or model algorithms/architecture. To begin with, we can compare metrics and take it forward from there.
We can compare metrics tracked at the end of the experiment with dvc metrics diff command. This is done as part of the Update DVC Pipeline stage defined in our pipeline. With DAGsHub it will automatically detect all the experiments which were run as part of the pull request and makes it easier for us to compare them with experiments in the main branch. To learn more, check out Data Science Pull Requests from DAGshub.
Figure 7: Comparing metrics between feature branch and master
Takeaways
- Jenkins is the most widely used open-source automation tool
- DVC is great for versioning data/model, define experiment pipelines, experiment tracking, etc.
- As shown here, we can also run ML experiments as part of our CI/CD pipelines. Doing so ensures standardization of the experiment execution environment and the process. This will ensure everyone is measuring the same thing on the same machine in the same way
- Integrating this as part of your pull request, allows you to achieve more transparency in your review process, especially when used with Data Science Pull Requests.
- Automation can seem like overhead, but it provides productivity multipliers which add up to huge benefits, especially as we gradually improve automation and don't regress
This article first appeared on DAGsHub.com.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.