A while back, a client, who host user-facing databases in Azure SQL Database, had a novel problem. One of their customers, had all of their infrastructure in AWS, and wanted to be able to access my client’s data in an RDS instance. There aren’t many options for doing this–replication doesn’t work with Azure SQL Database as a publisher because there’s no SQL Agent. Managed Instance would have been messy from a network perspective, as well as cost prohibitive compared to Azure SQL DB serverless. Even using an ETL tool like Azure Data Factory would have worked, but would have required a rather large amount of dev cycles to check for changed data. Enter Azure Data Sync.
If you’ve ever noticed this button on the Azure SQL Database blade in the portal:
and wondered what it does. Azure Data Sync works conceptually like transactional replication. If you go to the docs page for Azure Data Sync, you can see it works just like replication, except a little worse. The big advantage that data sync has is that it does not rely on a distribution database, which means you can use an Azure SQL DB as a source for data sync, which you can’t do for a replication. Data sync uses a system of triggers and tables, to identify which rows it needs to send to the receivers. Typically an additional database is used for that, however, in my scenario, I keep all of the data sync objects in my user database, because it’s only being used for syncing purposes.
We had a customer of one our customers who, needed to send data from Azure SQL DB to Amazon RDS–while this is very possible with data sync, however this was challenging, due to vague documentation. I’m not going to cover the network detail of this in great detail, but the basic idea is that the VM that your data sync agent is running on, needs to be able to talk to your RDS instance on port 1433. There are numerous ways to make this network approach happen, but that part is the not that had to figure out.
The sync agent gets installed on a VM. I ran into one major challenge with the VM–my database was about 200 GB, with a lot of compressed and columnstore data. The way data sync works, data is extracted onto the agent VM, and not streamed directly into the database. Also, for whatever reason, that data seems to be extracted twice–which means you need to size the data drive on that VM accordingly. In my case, I had to increase the size of the drive to a TB to allow the process to complete. You will also want to use a premium managed disk, because the performance of the drive directly impacts the duration of the sync process.
My clients data sync scenario pushes data once a day rather than all the time. In order to save money, I used Azure automation to power the agent VM on and off, and after powering down, change the premium drive to a standard to further save costs. The Data Sync process can be started via PowerShell, however, the process is asynchronous, which means, you need a separate automation runbook to monitor the process of that job (which is what I used to power down the VM when the process is complete.