I’ve seen a few people start Azure Data Factory (ADF) projects assuming that we would have one source control repo per environment, meaning that you would attach a Git repo to Dev, and another Git repo to Test and another to Prod.
Microsoft recommends against this, saying:
“Only the development factory is associated with a git repository. The test and production factories shouldn’t have a git repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource Management template.”Microsoft Learn content on continuous integration and delivery in Azure Data Factory
You’ll find the recommendation of only one repo connected to the development data factory echoed by other ADF practitioners, including myself.
We use source control largely to control code changes and for disaster recovery. I think the desire to use multiple repos is about disaster recovery more than anything. If something bad happens, you want to be able to access and re-deploy your code as quickly as possible. And since we start building in our repo-connected dev environment, some people feel “unprotected” in higher environments.
But Why Not Have All the Repos?
For me, there are two main reasons to have only one repository per solution tied only to the development data factory.
First, having multiple repos adds more complexity with very few benefits.
Having a repo per environment adds extra work in deployment. I can see no additional benefits for deployment from having a repo per environment for most ADF solutions. I won’t say never, but I have yet to encounter a situation where this would help.
When you deploy your data factory, whether you use ARM templates or individual JSON files, you are publishing to the live ADF service. This is what happens when you publish from your collaboration branch in source control to the live version in your development data factory. If you follow normal deployment patterns, you deploy from the main (if you use JSON files) or adf_publish (if you use the ARM template) branch in source control to the live service in Test (or whatever your next environment is). If your Test data factory is connected to a repo, you need to figure out how to get your code into that repo.
Would you copy your code to the other repo before you deploy to the service? What if something fails in your deployment process before deployment to the live service is complete? Would you want to revert your changes in the Git repo?
You could deploy to the live service first and skip that issue. But you still need to decide how to merge your code into the Test repo. You’ll need to handle any merge conflicts. And you’ll likely need to allow unrelated histories for the merge to work, so when you look back in your commit history, it probably won’t make sense.
At best, this Test repo becomes an additional place to store the code that is currently in Test. But if you are working through a healthy development process, you already had this code in your Dev repo. You might have even tagged it for release, so it’s easy to find. Your Git repo is likely already highly available, if it is cloud-based. In my mind, this just creates one more copy of your code that can get out of date, and one more deployment step. If you just want a copy of what is in Test or Prod for safe keeping, you can always export the resource’s ARM template. But if I were to do that, I would be inclined to keep it in blob storage or somewhere outside of a repo, since I already have the code in a repo. This would allow me to redeploy if my repo weren’t available.
Then, once you have sufficiently tested your data factory in Test, would you deploy code to Prod from the Test repo or from the Dev repo?
If you have the discipline and DevOps/automation capabilities to support these multiple repos, you likely don’t want to do this, unless you have requirements that mandate it. That brings me to my second reason.
Deviation from Common DevOps Practice
Having a repo per environment is a deviation from common software engineering practices. Most software engineering projects do not have separate repos per environment. They might have separate repos for different projects within a solution, but that is a different discussion.
If you have a separate repo for dev and test, what do you do about history? I think there is also a danger that people would want to make changes in higher environments instead of working through the normal development process because it seems more expedient at the time.
When you hire new data engineers or dev ops engineers (whoever maintains and deploys your data factories), you would have to explain this process with the multiple repos as it won’t be what they are expecting.
Unless you have some special requirements that dictate this need, I don’t see a good reason to deviate from common practice.
Common Development Process
For a data factory project, we must define a collaboration branch, usually Main. This branch is the only branch that can publish to the live service in your Dev data factory. When you need to update your data factory, you make a (hopefully short-lived) feature branch based off of your collaboration branch. My preference for a medium to large project is to have the Main branch, an Integration branch, and one or more feature branches. The Integration branch brings multiple features together for testing before the final push to Main. On smaller projects with one or two experienced developers, you may not need the integration branch. I find that I like the integration branch when I am working with people who are new to ADF, as it gives me a chance to tweak and execute new pipelines before they get to Main.
Developers work in the feature branches and then merge into the integration branch as they see fit. They resolve any errors and make any final changes in integration and then create a pull request to get their code into Main. Once the code is merged into Main and published to the live service (either manually or programmatically), the feature branches and Integration branch are deleted, preparing you to start the next round of development. Triggering the pipelines in the live service after publishing gives you a more realistic idea of execution times as ForEach activities may not run in parallel when executed in debug mode.
The code in Main should represent a version of your data factory that is ready to be deployed to Test. Code is deployed from Dev to Test according to your preference—I won’t get into all the options of JSON files vs ARM templates and DevOps pipelines vs PowerShell/custom code in this post.
You perform unit testing, integration testing, and performance testing (and any other type of testing as well, but most people aren’t really doing these three in any sufficient manner) in your Test data factory. If everything looks good, you deploy to Production. If there are issues, you go back to your development data factory, make a new feature branch from Main, and fix the issue.
If you find a bug in production, and you can’t use the current version of code in Main, you might want to create a hotfix/QFE. Your hotfix is still created in your development data factory. The difference is that instead of creating a feature branch from Main, you create the branch from the last commit deployed to production. If you are deploying via ARM templates, you can export the ARM template from that hotfix branch and manually check it in to the adf_publish branch. If you deploy from JSON files, selective deployment is a bit easier. I like to use ADF Tools for deployment, which allows me to specify which files should be deployed, so I can do a special hotfix deployment that doesn’t change any other objects that may have already been updated in Main in anticipation of the next deployment.
Having a repo per environment doesn’t technically break anything, but it adds complexity without significant benefits. It adds steps to your deployment process and deviates from industry standards. I won’t go so far as saying “never”, as I haven’t seen every project scenario. If you were considering going this route, I would encourage you to examine the reasons behind it and see if doing so would actually meet your needs and if your team can handle the added complexity.