Using Azure Automation to Shut Down a VM only if a SQL Agent Job is Not Running.

I have a client who uses MDS (Master Data Services) and SSIS (Integration Services) in an Azure VM. Since we only need to execute the SQL Agent job that runs the SSIS packages infrequently, we shut down the VM when it is not in use in order to save costs. We wanted to make sure that the Azure VM did not shut down when a specific SQL Agent job was still running, so I tackled this with some PowerShell runbooks in Azure Automation.

I split the job into two parts. The first runbook simply checks if a specified SQL Agent job is running and returns a text value that indicates whether it is running. A parent runbook checks if the VM is started. If the VM is started, it calls the child runbook to check if the job is running, and then shuts down the VM if the job is not running.

It’s fairly easy and convenient to have nested PowerShell runbooks in Azure Automation. There are two main ways to call a child runbook.

  1. Inline execution
  2. Using the Start-AzureRmAutomationRunbook cmdlet

It was less obvious to me how to call a child runbook when the parent runs in Azure and the child runs on a hybrid worker, especially when you need to use the output from the child runbook in the parent. A hyrid runbook worker allows us to access resources that are behind a VNET or on premises.

Travis Roberts has a nice video on just this topic that gave me the answers I needed.

Below is my parent runbook.

# Ensures you do not inherit an AzureRMContext in your runbook
Disable-AzureRmContextAutosave –Scope Process

$connection = Get-AutomationConnection -Name AzureRunAsConnection
Connect-AzureRmAccount -ServicePrincipal -Tenant $connection.TenantID `
-ApplicationID $connection.ApplicationID -CertificateThumbprint $connection.CertificateThumbprint

$rgName='MyResourceGroup'
$vmName='MyVM'
$SubID = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'

$AzureContext = Select-AzureRmSubscription -SubscriptionId $SubID
'Check if VM is on'
$vm=((Get-AzureRmVM -ResourceGroupName $rgName -AzureRmContext $AzureContext -Name $vmName -Status).Statuses[1]).Code
 $vm 
 if ($vm -eq 'PowerState/running')
 {
    Do 
    {
        #if VM running call other runbook
        start-sleep -Seconds 60;
        'Check if job is running'
        $JobRunning = start-azureRMautomationrunbook -AutomationAccount 'ProgramsAutomation' -Name 'CheckRunningSQLJob' -ResourceGroupName $rgName -AzureRMContext $AzureContext -Runon 'Backups' -Wait;
        Write-Output $JobRunning
        
        
    } Until ($JobRunning -eq 'run0')
    
    'Stopping VM'
    stop-azurermvm -Name $VMname -ResourceGroupName $RgName -force
}

The runbook sets the Azure context to the appropriate subscription (especially important when you are a guest user in someone else’s tenant). Then it checks if the VM is started. If it is, it goes into a do-while loop. This task isn’t super time sensitive (it’s just to save money when the VM isn’t in use), so it’s waiting 60 seconds and then calling the child runbook to find out if my SQL Agent job is running. This makes sure that the child runbook is called at least once. If the result is that the job is not running, it stops the VM. If the job is running, the loop starts over, waiting 60 seconds before checking again. This loop is essentially polling the job status until it sees that the job is completed. One thing to note is the -Wait parameter on the end of that Start-AzureRmAutomationRunbook command. If you don’t specify the -Wait parameter, the command will immediately return a job object. If you specify the -Wait parameter, it waits for that child job to complete and returns the results of that job.

And here is my child runbook.

[OutputType([string])]

$SQLJobName = 'MySQLAgentJobName'
$SQLInstanceName = 'MySQLServer

$cred=Get-AutomationPSCredential -Name 'mycredential'
 
$server = Connect-DbaInstance -SqlInstance $SQLInstanceName -SqlCredential $cred
 
Get-DbaRunningJob -SqlInstance $server | Get-DbaRunningJob

$JobStatus = (Get-DbaRunningJob -SqlInstance $server).Name -match $SQLJobName

If ($JobStatus -ne $false) 
{
#job is running. Passing back a string because bits and ints were causing issues.
    $JobRunning = 'run1'
    Write-Output $JobRunning 
}
else 
{ 
#job is idle
    $JobRunning = 'run0'
    Write-Output $JobRunning 
}

I’m using dbatools to check if the job is running on the server. That is the Get-DBARunning Job command. The important part to note is that you have to use the Write-Output command for this output to be available to the parent runbook. I got some weird results when I tried to return an int or a boolean (it was returning an object rather than a single value), so I just went with a string. The string, while not the most efficient, works just fine. If you understand why this is, feel free to leave me a comment.

These runbooks have been in place for a couple of months now, and they are working great to shut down the VM to save money while making sure not to disturb an important SQL Agent job that might occasionally run late. I didn’t find much documentation nor many examples of using output from a child job that runs on a hybrid worker, so I wanted to get this published to help others that go searching.


Contact the Author | Contact DCAC

Why We Don’t Truncate Dimensions and Facts During a Data Load

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Surrogate Keys

Dimensional models use surrogate keys rather than natural keys. Surrogate keys are system-generated, meaningless values that are usually integers used to uniquely identify a record. They provide good performance for joins in queries, allow us to switch or use multiple source systems to feed the same tables, and facilitate the use of slowly changing dimensions. If you truncate a dimension table and then repopulate it, you will end up with different surrogate keys assigned to your dimension values. Let’s say we have a Geography dimension that looks like the table below.

GeographyKeyCityStateProvinceCountryPostalCode
1DenverColoradoUnited States80205
2DallasTexasUnited States75201
3ParisÎle-de-FranceFrance75001

If I truncate and reload this table, there is no guarantee that Denver will have a GeographyKey value of 1. I might reload the data and then have Paris as 1, Denver as 2, and Dallas as 3. And since we use surrogate keys as dimension lookups in fact tables and bridge tables (and snowflaked dimensions, if you use those), I would now need to update every table that references this dimension. That’s a lot of table updates without a good reason.

Error Handling

Another reason to avoid truncating the tables in your dimensional model is error handling. When you design your data load, you need to think about what should happen when it fails and where it is most likely to fail. Failures will often occur either in data transformation steps or upon inserting/updating values in the destination table due to data type/size conflicts.

Let’s say you truncate your fact table and then you encounter an ETL failure while performing the transformations to reload the table. Now you have an empty fact table. If you have error handling logic in your ETL, you may be able to redirect the error rows to another location to be handled later. But that still leaves you with an incomplete table at best. If instead of truncating and reloading, you were doing inserts and updates when you encountered an error, your table might reflect slightly stale data as of the previous load. You still might have a situation where your table is partially updated with the rows that were inserted before the error was encountered depending on how your ETL design. But having a fact table with old or partially updated data is usually (but not always) more preferable than having no data when a data load process fails.

Performance and Data Availability

When you truncate and reload a table, you are assuming you have access to all of the source data needed to fully repopulate the table. This may be true today, but what about next year when you switch CRM systems? Or what if your organization makes the decision to archive data in the source system that is older than 5 years? If any data needed for your fact or dimension becomes unavailable, your truncate and reload pattern will fail to serve your needs.

Even if you are sure your data will be available, you may want to consider your data load times. If you only have a handful of tables with a few hundred thousand rows max, you may load your data warehouse in a few minutes. But what happens when you have a few hundred million rows with some complicated transformation logic in your pipelines? You might be adding minutes to hours to your load times.

Why Do ETL Developers End Up Truncating Fact Tables?

Sometimes developers just don’t know better. But often there were understandable reasons for using the truncate and reload pattern. While I have never built a system where I truncated a dimension, I have had a couple were we at least started out with truncating the fact table. We usually built it to make data initially available with correct values, and then worked with stakeholders to find a different way to access the data or to have them generate the data differently to alleviate the problems that caused us to want to truncate and reload.

Often the cause is data quality issues. One one project I worked, we had dimension values that defined the granularity of the fact table that would change in the source systems, and the process to try to update them in the fact table was too complicated and took too long. So we made the decision to truncate and reload, understanding the risks of doing so. Having the data available and usable for analysis gave us more information on how useful the data was to users and helped us work to understand why and how our keys were changing. Over time, we were able to influence the way data was entered into source systems so that we didn’t have to go through the truncate and reload process and could perform incremental loads on our fact table.

Sometimes source systems allow hard deletion of data and don’t maintain a list of what was deleted. And depending on access methods and the size of the data, it can be difficult to compare the rows in a fact table with the source data to identify the deleted rows. I would argue this is a poor design of the source system, but we often can’t change that, and we still need to include the data in our data warehouse. So it’s understandable why someone in this situation would want to truncate and reload a fact table.

If your stakeholders are ok with the potential of long load times, empty tables when errors occur, and the assumption that source data will always be available, then there isn’t much problem with truncating and reloading a fact table. But that is often not the case. And that decision should be made explicitly rather than implicitly.

What To Do Instead of Truncating and Reloading a Fact Table

If you’ve been truncating and reloading a fact table because that just seemed like the simplest thing to do, you can change your load pattern.

My general load pattern is:

  1. Truncate the update table
  2. Insert new rows into the fact table and changed rows into the update table
  3. Perform set-based updates on the fact table based upon the data in the update table

I’m able to determine what rows are new, changed, and unchanged by using hash values. I concatenate and hash the values that define the level of uniqueness of the row into one value I call HistoricalHash and the values from the remaining columns into a value I call ChangeHash. In other words, I load a transactional or periodic snapshot fact table in a manner similar to a Type 1 slowly changing dimension.

If you have data quality, data deletion, or other issues that prevent you from using a change detection pattern like the above, consider using a staging table and swapping it out with the fact table. Create a staging table that looks exactly like your fact table.

  1. Truncate the staging table
  2. Populate the staging table with all rows that should be in the fact table
  3. Swap the staging table and the fact table (this usually involves renaming the tables)

Understand the Consequences of Your Design

If you are choosing to truncate a dimension or fact table, be sure that you understand the trade-offs. You may be unwittingly opting for simplicity of ETL over data availability and efficient data loads.


Contact the Author | Contact DCAC

Understanding Columnstore Indexes in SQL Server Part 3

My last two blogs were on understanding columnstore and creating columnstore indexes. In this post, I will touch on one last topic, when to use columnstore indexes. As noted in my prior posts these indexes are designed for large data warehouse workloads, not normal OLTP workload tables. As with any index design it is important to know your data and what you will be using the data for. Be sure to look to what types of queries you will be running against it, as well as how the data is loaded and maintained. Let’s look at a few questions you should ask yourself before deciding to implement a columnstore index. Just because your table consist of millions of rows doesn’t mean columnstore is the right way to go.

First you need to know the data you are designing the index for.

Is your table large enough to benefit? As we reviewed in my first post usually this means in the millions of records range as these rows are divided into groups of rows, called a rowgroup.  A rowgroup has a minimum of 102,400 rows with a max of approximately 1 million rows.  Each rowgroup is changed into column segments. Therefore, having a columnstore index on a table with under a 1 million rows does not make sense in that if the table is too small you don’t get the benefits of compression that comes with the column segments. A general recommendation is to use columnstore indexes on the fact tables in your data warehouse, and on very large dimension tables, containing over 5 million rows.

Is your data volatile, meaning changing frequently? Rule of thumb says you want tables that rarely have data modifications, more specifically, where less then 10% of the rows are modified. Having large numbers of deletes can cause fragmentation, which adversely affect compression rates, thus reducing the efficiency of the index. Updates in particular are expensive, as they are processed as deletes followed by inserts, which will adversely affect the performance of your loading process.

What datatypes are in your table? There are several data types that are not supported within a columnstore index. Data types like varchar(max), nvarchar(max), or varbinary(max) were not supported until SQL Server 2017, and typically aren’t the best fit for this type of workload. Additionally, if you are using uniqueidentifiers (GUIDs) you won’t be able to create your index as they are still not supported.

Next what are you doing in your queries.

Are you doing aggregations or performing analytics on the data or are you looking for specific values? The standard B-tree rowstore indexes are best for singleton lookups (single values) and are sometimes used in tandem with a columnstore index. If you’re using an index to cover a where clause that does not look up a range of values and is just filling predicates, then columnstore does not benefit. This especially  true if you need to “cover” that query with included columns since columnstore does not allow included columns. However, columnstore is designed to quickly process aggressions especially on a grouped range of values. So, if you are performing aggregations or analytics usually columnstore can give you large performance gains as it can do full table scans to perform aggregations very fast.

Now there are times where you want to seek a specific value and perform aggregations (think average sale price in the last quarter for a specific product). In these cases you may benefit from a combination of a rowstore and columnstore index. Creating the columnstore to handle the grouping and aggressions and covering the index seek requirement with the rowstore. Adding these b-tree indexes will help your query performance, but they can dramatically impact the loading process—if your data loads are large enough it may be more efficient to drop the b-tree index, and rebuild it after loading the data into the columnstore.

Is this a data warehouse fact or dimension table? As we know a dimension table is usually used to find specific values, mostly lookups to match with an aggregated value from a fact table. If it’s a dimension table, typically you are going to use a b-tree based models, with the exception of very large dimensions. We tend to use columnstore on fact tables in the data warehouse as these tables are normally the source of aggregations.

Summary

Knowing your data is pivotal in deciding whether or not a columnstore index is best suited for your query needs. Don’t just jump to columnstore indexes now that you have an understanding of them, and your tables are large.  Make sure to take the time to choose the right indexing option for your usage patterns, the data and the overall environment.


Contact the Author | Contact DCAC

Understanding Columnstore Indexes in SQL Server Part 3

My last two blogs were on understanding columnstore and creating columnstore indexes. In this post, I will touch on one last topic, when to use columnstore indexes. As noted in my prior posts these indexes are designed for large data warehouse workloads, not normal OLTP workload tables. As with any index design it is important to know your data and what you will be using the data for. Be sure to look to what types of queries you will be running against it, as well as how the data is loaded and maintained. Let’s look at a few questions you should ask yourself before deciding to implement a columnstore index. Just because your table consist of millions of rows doesn’t mean columnstore is the right way to go.

First you need to know the data you are designing the index for.

Is your table large enough to benefit? As we reviewed in my first post usually this means in the millions of records range as these rows are divided into groups of rows, called a rowgroup.  A rowgroup has a minimum of 102,400 rows with a max of approximately 1 million rows.  Each rowgroup is changed into column segments. Therefore, having a columnstore index on a table with under a 1 million rows does not make sense in that if the table is too small you don’t get the benefits of compression that comes with the column segments. A general recommendation is to use columnstore indexes on the fact tables in your data warehouse, and on very large dimension tables, containing over 5 million rows.

Is your data volatile, meaning changing frequently? Rule of thumb says you want tables that rarely have data modifications, more specifically, where less then 10% of the rows are modified. Having large numbers of deletes can cause fragmentation, which adversely affect compression rates, thus reducing the efficiency of the index. Updates in particular are expensive, as they are processed as deletes followed by inserts, which will adversely affect the performance of your loading process.

What datatypes are in your table? There are several data types that are not supported within a columnstore index. Data types like varchar(max), nvarchar(max), or varbinary(max) were not supported until SQL Server 2017, and typically aren’t the best fit for this type of workload. Additionally, if you are using uniqueidentifiers (GUIDs) you won’t be able to create your index as they are still not supported.

Next what are you doing in your queries.

Are you doing aggregations or performing analytics on the data or are you looking for specific values? The standard B-tree rowstore indexes are best for singleton lookups (single values) and are sometimes used in tandem with a columnstore index. If you’re using an index to cover a where clause that does not look up a range of values and is just filling predicates, then columnstore does not benefit. This especially  true if you need to “cover” that query with included columns since columnstore does not allow included columns. However, columnstore is designed to quickly process aggressions especially on a grouped range of values. So, if you are performing aggregations or analytics usually columnstore can give you large performance gains as it can do full table scans to perform aggregations very fast.

Now there are times where you want to seek a specific value and perform aggregations (think average sale price in the last quarter for a specific product). In these cases you may benefit from a combination of a rowstore and columnstore index. Creating the columnstore to handle the grouping and aggressions and covering the index seek requirement with the rowstore. Adding these b-tree indexes will help your query performance, but they can dramatically impact the loading process—if your data loads are large enough it may be more efficient to drop the b-tree index, and rebuild it after loading the data into the columnstore.

Is this a data warehouse fact or dimension table? As we know a dimension table is usually used to find specific values, mostly lookups to match with an aggregated value from a fact table. If it’s a dimension table, typically you are going to use a b-tree based models, with the exception of very large dimensions. We tend to use columnstore on fact tables in the data warehouse as these tables are normally the source of aggregations.

Summary

Knowing your data is pivotal in deciding whether or not a columnstore index is best suited for your query needs. Don’t just jump to columnstore indexes now that you have an understanding of them, and your tables are large.  Make sure to take the time to choose the right indexing option for your usage patterns, the data and the overall environment.


Contact the Author | Contact DCAC
1 2 3 68

Video

Globally Recognized Expertise

As Microsoft MVP’s and Partners as well as VMware experts, we are summoned by companies all over the world to fine-tune and problem-solve the most difficult architecture, infrastructure and network challenges.

And sometimes we’re asked to share what we did, at events like Microsoft’s PASS Summit 2015.

Awards & Certifications

Microsoft Partner    Microsoft MVP    Microsoft Certified Master    VMWare Partner    VMWare vExpert
   Best Full-Service Cloud Technology Consulting Company    Insights Sccess Award    Technology Headlines Award    Golden Bridge Gold Award    CIO Review Top 20 Azure Solutions Providers