Training computer vision models are notoriously computationally intensive, often requiring multiple GPUs. It’s therefore usually not performed locally. One of the challenges when it comes to launching training jobs on the cloud or private GPU clusters is dealing with all the required manual steps. For example, when using AWS, ML engineers need to spin-up an EC2 instance manually to launch a training job, and then manually decommission it once the training job is completed.
Although commercial tools do exist to automate this process (for example, SageMaker or DataBricks), at Sama we decided to build our own automated training pipeline in order to limit costs, and to avoid tying ourselves to a particular cloud provider.
Our pipeline allows our ML Engineers and Scientists to launch a training job on the cloud by simply pushing the code they developed locally to a predefined git branch. This is very simple to achieve using a modern CI/CD platform like Codefresh. A Codefresh “trigger” can be set to track commits to specific git branches of a repository. Once a commit is pushed to a target branch, a Codefresh pipeline is triggered. The pipeline is just a workflow defined in a yaml file that executes the following steps :
- Clone the tracked git repository, as well as any other repos required (in our case we have a separate repo for all our ML related tools).
- Build a Docker image with all the project specific dependencies.
- Deploy the training job on the GPU cluster.
- Send a slack message to a channel that tracks all the training jobs whenever the results are ready or if there is an error.
The specific implementation of step 3 above depends on the particular frameworks and libraries that are used to train the models and track their performance, as well as the cloud provider. In our team, we use Kubernetes to orchestrate the creation and decommission of the instances. We also use mlflow to manage the ML lifecycle, which has built-in Kubernetes deployment support. So for us, this step simply reduces to doing some cloud provider specific configuration and running an mlflow experiment with Kubernetes as the backend.
Summary of the Sama automated training pipeline: from pushing the code to Github to running the training job on AWS.
Aside from the obvious time savings, the advantage of this setup is that it is fully configurable and can be made to work with any cloud provider, or even a private GPU cluster. As an added bonus, it enforces experiments to be separated in different git commits, which we find is good practice.