Machine Learning Models and Datasets Versioning Practices and Tools
The rise of AI and ML changes development workflow and requires new development tools: data versioning, ML pipeline versioning, experiments metrics tracking and others that have not been formalized an …
Talk Title | Machine Learning Models and Datasets Versioning Practices and Tools |
Speakers | Dmitry Petrov (Co-Founder & CEO, DVC), Ruslan Kuprieiev (Software Engineer, Iterative AI) |
Conference | Open Source Summit + ELC Europe |
Conf Tag | |
Location | Lyon, France |
Date | Oct 27-Nov 1, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The rise of AI and ML changes development workflow and requires new development tools: data versioning, ML pipeline versioning, experiments metrics tracking and others that have not been formalized and even named yet.Machine learning workflow is data-centric in contrast to source code-centric software engineering workflow. The traditional software engineering toolset does not fully cover ML team’s needs. We will discuss the current practices of organizing ML projects using traditional open-source tools like Git and Git-LFS as well as their limitations. Thereby motivation for developing new ML specific data management systems will be explained.Data Version Control or DVC.ORG is an open source, command-line tool. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favorite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects.