Data engineering
CD4ML with Jenkins in DAGsHub
agent { dockerfile { additionalBuildArgs '--tag rppp:$BRANCH_NAME' args '-v $WORKSPACE:/project -w /project -v /extras:/extras -e PYTHONPATH=/project' } }Details:
# Dockerfile FROM python:3.8 # Base image for our job RUN pip install --upgrade pip && pip install -U setuptools==49.6.0 # Upgrade pip and setuptools RUN apt-get update && apt-get install unzip groff -y # Install few system dependencies COPY requirements.txt ./ # Copy requirements.txt file into image RUN pip install -r requirements.txt # Installing project dependenciesWhen building docker images from a Dockerfile, we can control which files docker needs to consider to create docker context, by defining ignore patterns in a .dockerignore file. This enables faster and lighter Dockerfile builds. In our case, the only external file needed to build the Docker image is the requirements.txt file:
# .dockerignore * # Ignores everything !requirements.txt # except for requirements.txt file ;)Now that we’ve defined the docker image we want to use to run our pipeline, Let’s dive into our Jenkins pipeline stages.
stage('Run Unit Test') { steps { sh 'pytest -vvrxXs' } }
stage('Run Linting') { steps { sh ''' flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics flake8 . --count --max-complexity=10 --max-line-length=127 --statistics black . --check --diff ''' } }
stage('Setup DVC Creds') { steps { withCredentials( [ usernamePassword( credentialsId: 'DVC_Remote_Creds', passwordVariable: 'PASSWORD', usernameVariable: 'USER_NAME'), ] ) { sh ''' dvc remote modify origin --local auth basic dvc remote modify origin --local user $USER_NAME dvc remote modify origin --local password $PASSWORD dvc status -r origin ''' } } }Explanation:
stage('Sync DVC Remotes') { steps { sh ''' dvc status dvc status -r jenkins_local dvc status -r origin dvc pull -r jenkins_local || echo 'Some files are missing in local cache!' # 1 dvc pull -r origin # 2 dvc push -r jenkins_local # 3 ''' } }
Options | Pros | Cons |
For All Commits | We will never miss any experiment | This will increase build latency. It will be extremely expensive if we use cloud resources for training jobs. Might be overkill to run the DVC pipeline for all commits/changes |
Only for changes in the master branch | Only master branch experiments are saved, which ensures only "approved" changes and experiments are tracked. | We can not compare experiments in the feature branch before merging it to master.“Bad” experiments can slip through the PR review process and get merged to master before we could catch it. |
Setup a manual trigger | We can decide when we want to run/skip an experiment. | Automation is not complete. There is still room for manual errors. |
“Special” Commit message syntax | We can decide when we want to run/skip an experiment. | Automation is not complete. There is still room for manual errors. Commits are immutable, and it would be awkward to amend or create a new commit just to add the instruction. It also mixes MLOps instructions with the real purpose of the commit messages - documenting the history of the code |
On Pull Request | We can run and compare experiments before we approve the PR. No “Bad” experiments can now slip through the PR review process. | None |
stage('Update DVC Pipeline') {
when { changeRequest() } //# 1
steps {
sh '''
dvc repro --dry -mP
dvc repro -mP # 2
git branch -a
cat dvc.lock
dvc push -r jenkins_local # 3
dvc push -r origin # 3
'''
sh 'dvc metrics diff --show-md --precision 2 $CHANGE_TARGET' //# 4
}
}
stage('Commit back results') { when { changeRequest() } steps { withCredentials( [ usernamePassword( credentialsId: 'GIT_PAT', passwordVariable: 'GIT_PAT', usernameVariable: 'GIT_USER_NAME'), ] ) { sh ''' git branch -a git status if ! git diff --exit-code dvc.lock; then # 1 git add . git status git config --local user.email $JENKINS_EMAIL # 2 git config --local user.name $JENKINS_USER_NAME # 2 git commit -m "$GIT_COMMIT_REV: Update dvc.lock and metrics" # 3 git push https://$GIT_USER_NAME:$GIT_PAT@dagshub.com/puneethp/RPPP HEAD:$CHANGE_BRANCH # 4 else echo 'Nothing to Commit!' # 5 fi ''' } } }
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.