I’m writing this post because I’ve been mired in configuring a bunch of distributed availability groups for a client, and while the feature is technically solid, the lack of tooling can make it a challenge to implement. Specifically, I’m implementing these distributed AGs (please don’t use the term DAG as you’ll piss off Allan Hirt, but more importantly its used in Microsoft Exchange High Availability, so it’s taken) in Azure which adds a couple of additional changes because of the need for load balancers. You should note this feature is Enterprise Edition only, and is only available starting with SQL Server 2016.
First off why would you implement a distributed availability group? If you want to implement a disaster recovery (DR) strategy in addition to a high availability strategy with your AG. There’s limited benefit of implementing this architecture if you don’t have at least four nodes in your design. But consider the following design:
In this scenario, there are two data centers with four nodes. All of the servers are in a single Windows Server Failover Cluster. There are three streams from the transaction log on the primary which is called SQL1. This means we are consuming double the network bandwidth to send data to our secondary site in New York. With the distributed availability group, each location gets its own Windows Cluster and availability group, and we only send one transaction log stream across the WAN.
This benefits a few scenarios–the most obvious being, it’s a really easy way to do a SQL Server upgrade or migration. While Windows clustering now supports rolling OS upgrades, its much easier to do a distributed AG, because the clusters are independent of each other and have no impact on each other. The second is that its very easy to fail back and forth between these distributed availability groups. You have also reduced by half the amount of WAN bandwidth you need for your configuration, which can represent a major cost savings in a cloud world or even on-premises.
If you think this is cool, you with smart people–this is the technology Microsoft has implemented for geo-replication in Azure SQL Database. The architecture is really robust, and if you think about the tens of thousands of databases in Azure, you can imagine all of the bandwidth saved.
That’s Cool How Do I Start?
I really should have put this tl;dr at the start of this post. You’ll need this page at docs.microsoft.com. There’s no GUI. Which kind of sucks, because you can make typos in your T-SQL and the commands can still potentially validate and give you non-helpful error messages (ask me how I know). But in a short list here is what you do:
Create your first WSFC on your first two nodes
Create an Availability Group on your first WSFC, and create a listener. Add your database(s) to this AG
If you are in Azure, ensure your ILB has port 5022 (or whatever port you use for your AG endpoint) open
Create your second WSFC on the remaining two nodes
Create the second AG and listener, without a database. In case you really want to use the AG wizard, add a database to your AG, and then remove it. (Or quit being lazy and use T-SQL to create your AG)
Create the distributed AG on the first AG/WSFC
Add the second AG to your distributed Availability Group
This seems pretty trivial and when all of your network connections work (you need to be able to hit 1433 and 5022 from the listener’s IP address across both clusters). However, SQL Server has extremely limited documentation and management around this feature. The one troubleshooting hint I will provide is to always check the error log of the primary node of the second AG (this is known as the global forwarder), which is where you will see any errors. The most common error I’ve seen is:
A connection timeout has occurred while attempting to establish a connection to availability replica ‘dist_ag_00’ with id [508AF404-ED2F-0A82-1B8A-EA23BA0EA27B]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance
Sadly, that error is a bit of a catch all. In doing this work, I had a typo in my listener name on the secondary and SQL Server still processed the command. (So there’s no validation that everything and connect when you create the distributed AG). I’m sure in Azure this is all done via API calls, which means humans aren’t involved, but since there is no real GUI support for distributed AGs, you have to type code. So type carefully.
Overall, I think distributed availability groups are a nice solution for high available database servers, but without more tooling there won’t be broader adoption, and in turn, there won’t be more investment from Microsoft in tooling. So it’s a bit of a catch 22. Hopefully this post helps you understand this feature, where it might be used, and how to troubleshoot it.
Occasionally, we must resize or realign our Tempdb log file (.ldf) or data files (.mdf or .ndf) due to a growth event that forces the file size out of whack. To resize we have three options, restart the SQL Server Service, add additional files, or shrink the current file. We most likely have all been faced with run away log files and in an emergency situation restarting the SQL Services may not be an option but we still need to get our log file size smaller before we run out of disk space for example. The process of shrinking down that file can get tricky so I created this flow chart to help you out if you ever get into this situation.
Now its very important to note that many of these commands will clear your cache and will greatly impact your server performance as it warms cache backup. In addition, you should not shrink your database data or log file unless absolutely necessary. But doing so, it can result in a corrupt tempdb.
Let’s walk through it and explain somethings as we go along.
First thing you must do is issue a Checkpoint. A checkpoint marks the log as a “good up to here” point of reference. It lets the SQL Server Database Engine know it can start applying changes contained in the log during recovery after this point if an unexpected shutdown or crash occurs. Anything prior to the check point is what I like to call “Hardened”. This means all the dirty pages in memory have been written to disk, specifically to the .mdf and .ndf files. So, it is important to make that mark in the log before you proceed. Now, we know tempdb is not recovered during a restart it is recreated, however this is still a requirement.
Next, we try to shrink the log by issuing a DBCC SHRINKFILE command. This is the step that frees the unallocated space from the database file if there is any unallocated space available. You will note the Shrink? decision block in the diagram after this step. It is possible that there is no unallocated space and you will need to move further along the path to free some up and try again.
DBCC SHRINKFILE (templog, 1000); --Shrinks it to 1GB
If the database shrinks, great congratulations, however for some of us we still might have work to do. Next up is to try and free up some of that allocated space by running DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE.
DBCC DROPCLEANBUFFERS – Clears the clean buffers from the buffer pool and columnstore object pool. This will flush cached indexes and data pages.
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
DBCC FREEPROCCACHE – Clears the procedure cache, you are probably familiar with as a performance tuning tool in development. It will clean out all your execution plans from cache which may free up some space in tempdb. As we know though, this will create a performance as your execution plans now have to make it back into cache on their next execution and will not get the benefit of plan reuse. Now it’s not really clear why this works, so I asked tempdb expert Pam Lahoud (B|T) for clarification as to why this has anything to do with tempdb. Both of us are diving into this to understand exactly why this works. I believe it to be related to Tempdb using cached objects and memory objects associated with stored procedures which can have latches and locks on them that need to be release by running this. Check back for further clarification, I’ll be updating this as I find out more.
DBCC FREEPROCCACHE WITH NO_INFOMSGS;
Once these two commands have been run and you have attempted to free up some space you can now try the DBCC SHRINKFILE command again. For most this should make the shrink possible and you will be good to go. Unfortunately, a few more of us may have to take a couple more steps through to get to that point.
The last two things I do when I have no other choice to get my log file smaller is to run those last two commands in the process. These should do the trick and get the log to shrink.
DBCC FREESESSIONCACHE– This command will flush any distributed query connection cache, meaning queries that are between two or more servers.
DBCC FREESESSIONCACHE WITH NO_INFOMSGS;
DBCC FREESYSTEMCACHE – This command will release all unused remaining cache entries from all cache stores including temp table cache. This covers any temp table or table variables remaining in cache that need to be released.
DBCC FREESYSTEMCACHE ('ALL');
In my early days as a database administrator I would have loved to have this diagram. Having some quick steps during stressful situations such as tempdb’s log file filling up on me would have been a huge help. So hopefully someone will find this handy and will be able to use it to take away a little of their stress.
Microsoft SQL Server has a much-maligned performance monitoring (perfmon) counter called Page Life Expectancy (PLE). What this counter monitors is how long SQL Server will expect to keep your data in memory.
Now it’s true that looking at this monitor right after the SQL Server has been restarted is kind of pointless. This is because when SQL Server starts up, the PLE should increase one value for every second that passes until the cache is completely full. This what we would expect it to do, as the PLE counter doesn’t know anything from before SQL Server was started. And since it has no historical information to work off of. So given that, it knows that SQL Server was started 10 seconds ago, so the PLE needs to be 10 seconds. One second after that, it’ll be 11 seconds, because again that’s all the information that the SQL Server instance has to work with. So when should we be looking at PLE?
The first time that we want to be monitoring PLE is to get a baseline for what we expect this counter should be reporting. If we don’t know what the baseline data is then we don’t know if a problem happens when the value changes.
When the value changes dramatically
The PLE value doesn’t have a target number that we watch for. What we are looking for with changes to PLE are changes to the baseline (see above). When PLE drastically changes is that drop normal, or is that something that is out of the ordinary? If PLE for the server is normally 5,000 and it drops to 65, is that drop normal or is it something that we need to look at?
One of the things we can do with PLE is look for trends. If the PLE drops every morning at 3am, then we need to see what’s happening at 3am that is causing the SQL Serer to flush the buffer pool each night at 3am.
NUMA Node Imbalance
PLE is tracked at the instance level, but this is just the result of a math formula. The actual raw PLE number is gathered at the NUMA node level. If we look at the NUMA node level PLE values and compare them to each other we can see potential imbalances between the NUMA nodes, and we can dive into those imbalances and see why SQL Server is doing that. Maybe we have a server configuration issue that needs to be addressed? Maybe we need to change the soft-NUMA settings for the server?
PLE Doesn’t Give Answers
As we’ve seen PLE doesn’t give answers. Instead, PLE points you in a direction and is one of the diagnostic tools that can be used to see that there is a problem that needs to be investigated and identified. PLE doesn’t specifically tell you what the problem is. It tells you that there is a problem and that the problem needs to be investigated.
This is clickbait post title, sorry. You are here now. The correct answer is that you should purchase MySQL as a database service from your favorite cloud provider (Google, Amazon, and Azure all offer prebuilt database as a service offerings) because they have gone through the trouble of making their solution highly available. I’ll speak to Azure, because I’m familiar with the platform–Microsoft doesn’t employ MySQL High Availability per se, however both the storage and the VM are highly available. If there is a disk failure, the service is not impacted, and if there is a failure in the compute tier, a new VM is provisioned.
My second recommendation, if you really, really want to build your own thing is to build a Windows Server Failover Cluster, and used shared storage. Make the MySQL service a clustered resource, and assign a floating IP to the service that will fail with it. (Yes, I know you have to pay M$ for the Windows Server licenses).
Why shouldn’t you use an open source solution to make your MySQL database highly available? First let’s look at a picture of a common MySQL high availability architecture:
If we think about what we need a clustering solution to provide it comes down to a few things:
Providing a floating IP address to allow connection to the primary
Check the health of the database services and initiate a failover in the event one of them isn’t healthy
Executing a clean database failover and providing the ability to easily fail back
Ensuring the stability of the overall cluster, maintaining quorum, and avoiding split brain scenarios
If you are using a shared storage scenario, the clustering solution needs to manage the shared storage to coordinate failover with services.
If you are using SQL Server with Windows Server Failover Clustering, the cluster service takes care of all of the above, and more. When you look to do this on Linux for MySQL that there about 10 different sets of components you can use to make the service highly available. At the basis of all of these solutions is MySQL replication it’s pretty trivial transactional replication. MySQL’s replication service is fairly robust, and the GTID implementation is pretty solid.
The problem is that the rest of the components are all mix and match. You could use Haproxy to float the IP address, but there’s no way to do a smart health check of the database. It simply does a port connection test. Which means, if your primary goes away, and then comes back without some advanced configuration your floating IP is going to fail back to the original primary whether it’s actually the primary in your replication pair. This is but one example–you are going to end up with 3 or 4 different components to execute each of these functions, and congratulations you are in charge of a complex distributed system that you are responsible for administering for the rest of your life.
But Joey, Facebook/Google/Pick You Other Favorite online megacorp run MySQL and they support it with 5 9s. Ok, sure, I don’t disagree with this–and as databases, MySQL and PostgreSQL are generally ok. But look around at your engineering staff–wait do you have engineering staff? If you don’t have a few people who have both really good Linux SA skills and DBA skills, you are going to be pretty quickly in situation where support is a challenge.
Finally, consider if you need an HA solution. Are you running on a virtual machine? As long as your infrastructure is solid, that probably gets you to about 99.5% availability on a bad week. What you absolutely want to avoid is the Windows 2000 paradigm, which is where your high availability solution incurs more downtime than a standalone system.
As Microsoft MVP’s and Partners as well as VMware experts, we are summoned by companies all over the world to fine-tune and problem-solve the most difficult architecture, infrastructure and network challenges.
And sometimes we’re asked to share what we did, at events like Microsoft’s PASS Summit 2015.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.