A Lesson in DR, Azure Site Recovery, and Troubleshooting

I need to blog more. Stupid being busy. Anyway, last week, we were doing a small scale test for a customer, and it didn’t work the way we were expecting, and for one of the dumbest reasons I’ve ever seen. If you aren’t familiar with Azure Site Recovery it provides disk level replication for VMs, and allows you to bring on-premises VMs online in Azure, or in another Azure region, if you VMs are in Azure already. It’s not an ideal solution for busy SQL Server VMs with extremely low recovery point objectives, however, if you need a simple DR solution for a group of VMs, and can sustain around 30 minutes of data loss, it is cheap and easy. The other benefit that ASR provides, similar to VMware’s Site Recovery Manager, is the ability to do a test recovery in a bubble environment.

Our test environment was as shown in the image below:

In our case, due to configuration difference between subscriptions, we had to have our test domain controller in a different virtual network, and peered to our network. The DC was also in a different Azure region, which wasn’t a big deal because you can still peer across regions. I have the additional box around the test environment because it is not connected to the rest of the network.

You should note that for a real disaster recovery scenario, you are likely better off having a live domain controller in each region where you want to operate. However, when you are testing you cannot use a live domain controller, as objects and passwords could get out of sync. For this test we added a DC to our ASR config as noted.

The other caveat to doing testing is that you need a way to login to your test environment. Because your test network(s) are not connected to your other networks, unless you create a new VPN gateway, you likely have no connection to the test bubble. I don’t recommend creating a new VPN, which leaves you a couple of options:

  1. Create a jump host with a public IP address and create an allow list for specific IPs to connect to it
  2. Use Azure Bastion which provides an endpoint with allows you to RDP/SSH in a browser into your VMs in a secure fashion.

We decided to go with option 2–which led to weird ramifications.

If you are testing in this scenario you want to bring your domain controller online before failing over your SQL Server and app server(s), It’s not completely necessary, but it will remove some headaches and reboots down the road. After you failover you domain control, you need seize the FSMO roles, and make the domain controller a global catalog, if it is not. However, you need to login with domain credentials to do all of that first.

There is no guidance about how to login using Bastion in docs. However, after a bunch of stupid troubleshooting yesterday. I discovered that if you attempt to login using the standard domain credentials (e.g. contoso\joey), you will see the following error message.

Instead you have to login with the user principal name (UPN), or in my case joey@contoso.com. We troubleshot this in a test environment yesterday and were confused when DNS worked and it seemed like most other AD functionality was in-place. After a couple of more attempts at failover (one of the benefits of of ASR is that it’s really easy to test failover multiple times).

Sadly, this behavior wasn’t documented anywhere that I could find, except in a forum post (which was on docs.microsoft.com, so I guess that counts), but it would be really nice if the portal just showed the correct format for how to login with a domain account. Beyond that, ASR is pretty cool, and allows for easy DR testing.

Share

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?