Many data experts have heard the term DataOps. There is often confusion, however, about what this term means in practice. This is understandable, as DataOps encompasses many topics which may be unrelated to each other. Moreover, DataOps practices are often missing from traditional Data Warehouse development, so data developers might be unfamiliar with these methods. In this article, we address what DataOps means from Zure data developers’ perspective and highlight the tools that Microsoft Azure offers for DataOps practices.
DataOps is commonly defined as DevOps with certain data-related features and nuances. For example, instead of code and application deployments, DataOps usually deals with database deployments and data model publishing, for example deploying an SQL project to the Staging and Production environments automatically. The DataOps process might also define how database backups and rollbacks are made, so that the system can recover from failures with minimal damage. Testing conventions in DataOps also differ slightly from DevOps, with more emphasis on data integrity, data quality and data diversity.
Before diving into our methods, a reminder about why DataOps practices are needed. The benefits of DataOps include:
- Ease of development, maximized automation
- Risk management & disaster recovery
- Collaboration (asynchronous)
- Quality control (common practices, automated testing, monitoring)
- Change tracking & accountability
In the rest of this article, we discuss how DataOps practices at Zure ensure that these capabilities are met in daily operation. This article is intended for data experts with beginner or intermediate level knowledge on DataOps.
Version control systems, such as Git, provide a common repository for code, scripts and other assets used by data solutions. Ideally, the version control system should contain all the resources needed to publish the solution to the cloud. This includes the cloud infrastructure, which will be discussed below.
Continuous integration (CI) in DevOps is the automated process of building or compiling the solution when the application code in the version control system is updated. Similarly, a database project might require such a build step to create a packaged version of the solution, for example a .dacpac file for an SQL Server. Usually, the CI process is triggered when a commit or a Pull Request is made into a specific branch, not on each commit to the version control repository.
Continuous integration may also include other preparation tasks that are executed only once for each build. These tasks often include different types of testing, such as unit testing, ML model testing or virus scanning of external files.
Many organizations also have a peer review process for code changes. Before the code is merged into the CI branch, the developer must create a Pull Request (also known as a Merge Request) from their development branch. The development branch will not be merged into the CI branch until the required reviewers have accepted the Pull Request.
As an additional safeguard, we recommend enforcing all commits to the CI branch via Pull Requests. This means that the version control system denies all direct commits, merging or rebasing to the CI branch. This further prevents accidental commits to the CI branch that might trigger the build process. Furthermore, a checklist of coding best practices is useful when reviewing Pull Requests. A common checklist provides a baseline for code changes, so that correct naming conventions and terminology are used across the application, for example.
Note that the version control system should be used mainly for storing code and data models. Generally, operational data should not be committed to version control. Data in the version control system is harder to access from operational systems and should therefore be stored in a database or other storage service. Sometimes the version control system can be used to store other types of data, such as test data, data samples or configuration data.
We recommend Azure (DevOps) Repos as a version control system when developing solutions for Azure. Azure Repos is a Git-based version control system that integrates well with other Azure services. If you do not wish to create an Azure DevOps account, Github is another recommended version control system that integrates well with Azure.
Continuous deployment (CD) is the automated process of publishing code changes to production. The CD process is usually launched automatically after the CI step has finished successfully. Since the CD process is automated, it is very important that sufficient safeguards have been put in place to avoid accidental changes from being released to production. Usually this is guaranteed by doing extensive testing in the CI phase. The production release can also be configured to require manual acceptance by a developer or an admin.
In addition to the main production environment, many data solutions have multiple copies of the environment for testing and development purposes. Some solutions may also have multiple production environments, spread across geographical locations or different user groups. Manually publishing the code to each environment is time-consuming and prone to errors. Managing the different deployment environments is an essential part of DataOps.
In Azure, we recommend Azure (DevOps) Pipelines for CD operations. At Zure, Azure Repos and Azure Pipelines are the main tools for managing deployments in the Azure Cloud. Azure Pipelines keeps track of past deployments and links the deployments to specific branches and commits in the code repository. The integration between Azure Repos and Azure Pipelines allows developers to examine exactly which program code has been released in a particular deployment. In case the deployment fails, developers can easily re-run the deployment with the same configuration as before, only choosing to re-run the failed steps if they wish.
We recommend defining the build and release pipelines as code, as opposed to using graphical user interfaces. For example, moving the pipeline to a new tenant is much easier if the pipeline can be re-created from code. In the code-based approach, developers can also define pipeline templates that are used as a starting point when creating new pipelines. In Azure Pipelines, the pipelines can be written in the YAML markup language.
Infrastructure as Code
One benefit of cloud platforms is that the underlying technical infrastructure can be defined as code. Any Azure service that can be created from the Azure Portal UI can be deployed programmatically using the command line, REST APIs or programming SDK’s. Defining the underlying infrastructure as code allows for automated scenarios, in which resources are created without intervention by developers. Automated resource creation can improve cost-efficiency in situations where the resources are needed only for a limited time. For example, a weekly Databricks job in a large cluster needs the computation resources only for the run duration. It might make sense to remove the cluster entirely between the runs to save any overhead costs that are accumulated from clusters that would otherwise be just shut down.
Another benefit of defining infrastructure as code is that it ensures the consistency of different deployment environments. Creating the resources manually might lead to human errors, as it is up to the developer to make sure that the same parameters are used in every deployment. It might also be difficult to discover afterwards which parameters were used for a specific deployment. If the infrastructure is defined as code, the code repository keeps track of changes to the resources and allows to go back to a previous configuration with minimal effort.
Data Product Release and Management
DataOps is mainly concerned with the management of databases and storage services, along with the processes that move and transform bulk loads of data in these services. The tools used for deployments depend on the service and it is usually up to the development team to decide which one to use. We won’t list recommendations of individual tools or services here, but a few things to keep in mind while choosing the right tool is the DataOps team’s familiarity with the tool, how well does the tool integrate with other CI/CD infrastructure (Azure DevOps) and does it meet all the feature requirements, especially in the cloud environment.
If Azure SQL is used, we highly recommend that an SQL database project is created using Visual Studio SQL Server Data Tools (SSDT). The database project helps to manage the objects in the database and provides tools for deploying the project to the cloud. The whole database solution can be packaged as a .dacpac file and published as a single entity. The database project may also contain multiple release profiles that determine the configuration for each deployment environment. If developers are more experienced with C# than SQL, the Entity Framework Core approach may also be used.
It should also be emphasized that the deployment process for structured (relational) data, such as SQL databases, is different than for non-structured data, such as Data Lake. Modern methods like Data Lakehouse combine elements from both approaches, possibly complicating things even further. If the DataOps team is unfamiliar with the deployment methods, it is important to recognize that implementing the deployment pipeline can be a time-consuming task, taking a significant portion of project time. Creating a common DataOps template for the organization might be a good idea, especially if there are multiple teams working with the same services.
Testing is often overlooked in data engineering. This is unfortunate, because rigorous testing could prevent bugs that would otherwise be identified when the application is already running in production. Data solutions are particularly interdependent by their nature, and changes to one part of the solution often have consequences elsewhere. Especially when there are multiple people working on the same project, it is very important that developers write tests for data when developing new features. It is also important that the tests are not done with minimal effort, but the team should spend some extra time to create meaningful tests. This improves long-term stability of the solution and saves some time from future bug hunting.
The tests should be run every time the solution is deployed. It is highly recommended that testing is automated and the tests are run during the build & release pipeline. Deployment should be conditioned on the test outcome so that the release is interrupted if the tests are not finished successfully.
Use Azure DevOps Pipelines to run the tests during each build. Most test frameworks can be run from the command line, so ideally you just need to add a single task to the build pipeline to automate testing. It is also recommended to add the Publish Test Results task so that the results can be viewed in the Azure Pipelines UI. Please note that the Publish Test Results task might constrain which test framework can be used. For example, Python code must be tested with pytest, since the Publish Test Results task does not support the unittest framework.
If Azure SQL is used, it is recommended to use SQL Server Data Tools to define unit tests. SSDT has a specific project type for SQL unit tests which includes many useful features for defining and running the tests. If SSDT is not available, we strongly recommend that some other unit test framework is used. Test frameworks include many useful features for writing tests and reduce the time spent on test automation.
As of time of writing (Nov. 2022), Azure Data Factory does not have any framework for unit testing nor integration testing. Hopefully this will change in the future, but in the meantime it is up to the developer to test the pipelines in a non-production instance, before publishing the pipelines to production. The testing environment should be very similar to the production environment in order to keep the tests as realistic as possible. One approach to test automation in Data Factory is creating a fixed test dataset that should always produce the same intermediate and end results. Doing a “dry run” of the pipelines with test data during the deployment can be considered as a unit test, but this approach requires additional validation steps to make sure that the results match with the expected outcomes.
Monitoring & alerting
Keeping the solution in operation with minimal failures is a significant part of DataOps. The support team should be able to detect errors without delay and identify the source of the error as accurately as they can. High-quality DataOps practices provide the tools for support engineers to keep track of key performance metrics in real time.
As a solution goes into production, the development team should decide the performance metrics used in the monitoring of the solution. Alerting thresholds for the metrics should also be decided, along with the channel that the alerts will be sent to. Azure Monitor is the main service for monitoring in Azure and is recommended because of its integration with other Azure services. For example, it is easy to set up a Microsoft Teams channel where all the alerts from a specific application are sent to.
The relevance of logging should also be emphasized when deploying a solution to production. Logs are usually the only way to get information about errors in production. Extensive and detailed logging in the application can be very helpful when investigating problems and reduces the time spent on fixing errors. Use Azure Application Insights to collect application logs and Azure Log Analytics Workspace to analyze the logs from multiple applications.
For a more low-level and more customizable monitoring solution, the Azure Event Hub and Azure Event Grid services can be used. These services can pass arbitrary messages across the whole Azure platform, although they require some customization to send alerts and metrics and to consume them from a client application.
The ways of working in data development are not much different from general application development. Data solutions are usually developed with the familiar agile methods, although the features are often less oriented towards the end user and more towards the technical abilities of the system. Data Science projects might follow their own process, but they are outside the scope of this article.
At Zure, we use Azure DevOps in our daily operations to manage our work and to communicate our progress to the customer. The Sprints, Boards and Backlogs features in Azure DevOps offer the tools for whichever agile process you are following. The work management features also integrate with the Azure Repos version control system. For example, individual work items can be linked to specific repository branches, increasing the traceability of features on the code level.
In this article, we have highlighted the DataOps practices that are commonly used at Zure. DataOps is a vast topic and the details always depend on the actual data architecture of the solution, so of course we have only scratched the surface here. Our main takeaway is that implementing at least some of the DataOps practices mentioned in this article will likely benefit the long-term error resilience of the application and reduce the time spent on production deployments. Data engineers might have a slight learning curve when implementing DataOps practices, but it will pay off when the total lifetime of the application is considered.