Why We Don’t Truncate Dimensions and Facts During a Data Load

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Surrogate Keys

Dimensional models use surrogate keys rather than natural keys. Surrogate keys are system-generated, meaningless values that are usually integers used to uniquely identify a record. They provide good performance for joins in queries, allow us to switch or use multiple source systems to feed the same tables, and facilitate the use of slowly changing dimensions. If you truncate a dimension table and then repopulate it, you will end up with different surrogate keys assigned to your dimension values. Let’s say we have a Geography dimension that looks like the table below.

GeographyKeyCityStateProvinceCountryPostalCode
1DenverColoradoUnited States80205
2DallasTexasUnited States75201
3ParisÎle-de-FranceFrance75001

If I truncate and reload this table, there is no guarantee that Denver will have a GeographyKey value of 1. I might reload the data and then have Paris as 1, Denver as 2, and Dallas as 3. And since we use surrogate keys as dimension lookups in fact tables and bridge tables (and snowflaked dimensions, if you use those), I would now need to update every table that references this dimension. That’s a lot of table updates without a good reason.

Error Handling

Another reason to avoid truncating the tables in your dimensional model is error handling. When you design your data load, you need to think about what should happen when it fails and where it is most likely to fail. Failures will often occur either in data transformation steps or upon inserting/updating values in the destination table due to data type/size conflicts.

Let’s say you truncate your fact table and then you encounter an ETL failure while performing the transformations to reload the table. Now you have an empty fact table. If you have error handling logic in your ETL, you may be able to redirect the error rows to another location to be handled later. But that still leaves you with an incomplete table at best. If instead of truncating and reloading, you were doing inserts and updates when you encountered an error, your table might reflect slightly stale data as of the previous load. You still might have a situation where your table is partially updated with the rows that were inserted before the error was encountered depending on how your ETL design. But having a fact table with old or partially updated data is usually (but not always) more preferable than having no data when a data load process fails.

Performance and Data Availability

When you truncate and reload a table, you are assuming you have access to all of the source data needed to fully repopulate the table. This may be true today, but what about next year when you switch CRM systems? Or what if your organization makes the decision to archive data in the source system that is older than 5 years? If any data needed for your fact or dimension becomes unavailable, your truncate and reload pattern will fail to serve your needs.

Even if you are sure your data will be available, you may want to consider your data load times. If you only have a handful of tables with a few hundred thousand rows max, you may load your data warehouse in a few minutes. But what happens when you have a few hundred million rows with some complicated transformation logic in your pipelines? You might be adding minutes to hours to your load times.

Why Do ETL Developers End Up Truncating Fact Tables?

Sometimes developers just don’t know better. But often there were understandable reasons for using the truncate and reload pattern. While I have never built a system where I truncated a dimension, I have had a couple were we at least started out with truncating the fact table. We usually built it to make data initially available with correct values, and then worked with stakeholders to find a different way to access the data or to have them generate the data differently to alleviate the problems that caused us to want to truncate and reload.

Often the cause is data quality issues. One one project I worked, we had dimension values that defined the granularity of the fact table that would change in the source systems, and the process to try to update them in the fact table was too complicated and took too long. So we made the decision to truncate and reload, understanding the risks of doing so. Having the data available and usable for analysis gave us more information on how useful the data was to users and helped us work to understand why and how our keys were changing. Over time, we were able to influence the way data was entered into source systems so that we didn’t have to go through the truncate and reload process and could perform incremental loads on our fact table.

Sometimes source systems allow hard deletion of data and don’t maintain a list of what was deleted. And depending on access methods and the size of the data, it can be difficult to compare the rows in a fact table with the source data to identify the deleted rows. I would argue this is a poor design of the source system, but we often can’t change that, and we still need to include the data in our data warehouse. So it’s understandable why someone in this situation would want to truncate and reload a fact table.

If your stakeholders are ok with the potential of long load times, empty tables when errors occur, and the assumption that source data will always be available, then there isn’t much problem with truncating and reloading a fact table. But that is often not the case. And that decision should be made explicitly rather than implicitly.

What To Do Instead of Truncating and Reloading a Fact Table

If you’ve been truncating and reloading a fact table because that just seemed like the simplest thing to do, you can change your load pattern.

My general load pattern is:

  1. Truncate the update table
  2. Insert new rows into the fact table and changed rows into the update table
  3. Perform set-based updates on the fact table based upon the data in the update table

I’m able to determine what rows are new, changed, and unchanged by using hash values. I concatenate and hash the values that define the level of uniqueness of the row into one value I call HistoricalHash and the values from the remaining columns into a value I call ChangeHash. In other words, I load a transactional or periodic snapshot fact table in a manner similar to a Type 1 slowly changing dimension.

If you have data quality, data deletion, or other issues that prevent you from using a change detection pattern like the above, consider using a staging table and swapping it out with the fact table. Create a staging table that looks exactly like your fact table.

  1. Truncate the staging table
  2. Populate the staging table with all rows that should be in the fact table
  3. Swap the staging table and the fact table (this usually involves renaming the tables)

Understand the Consequences of Your Design

If you are choosing to truncate a dimension or fact table, be sure that you understand the trade-offs. You may be unwittingly opting for simplicity of ETL over data availability and efficient data loads.

Contact the Author | Contact DCAC

What You Need to Know About Data Classifications in Azure SQL Data Warehouse

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

You can add data classifications in the Azure Portal or via T-SQL or PowerShell. Data classifications are database objects.

ADD SENSITIVITY CLASSIFICATION TO
    dbo.DimCustomer.Phone
    WITH (LABEL='Confidential', INFORMATION_TYPE='Contact Info')

To view existing data classifications, you can query the sys.sensitivity_classifications view or look in the Azure Portal.

SELECT
sys.all_objects.name as [TableName], 
sys.all_columns.name as [ColumnName],
[Label], 
[Information_Type], 
FROM sys.sensitivity_classifications
left join sys.all_objects on sys.sensitivity_classifications.major_id = sys.all_objects.object_id
left join sys.all_columns on sys.sensitivity_classifications.major_id = sys.all_columns.object_id
    and sys.sensitivity_classifications.minor_id = sys.all_columns.column_id

Be Careful When Loading With CTAS and Rename

One issue that is specific to using data classifications in Azure SQL DW is that it is possible to inadvertantly drop your classifications when you are loading your tables using the recommended T-SQL load pattern. Typically, when using T-SQL to load a dimensional model in Azure SQL DW, we perform the following steps:

  1. Create an upsert table via CTAS with the results of a union of new data from a staging table with existing data from the dimension table
  2. Rename the dimension table to something like Dimension_OLD
  3. Rename the upsert table to Dimension
  4. Drop the Dimension_OLD table
Animation of a table load process in Azure SQL DW


In the animation above, you’ll first see the load process as described, and then it will replay with sensitivity labels added to the dimension table. You’ll see that they are dropped when we drop the old dimension table. This makes sense because sensitivity classifications are objects related to that table. We would expect an index to be dropped when we drop the related table. This works the same way.

Check out my SQL notebook for a demonstration of the issue as well as my workaround that I describe below. If you spin up an Azure SQL Data Warehouse with the sample database, you can run this notebook from Azure Data Studio and see the results for yourself.

There are a few complicating factors:

  • There are currently no visual indicators of sensitivity classifications in SSMS or Azure Data Studio.
  • ETL developers may not have access to the data warehouse in the Azure Portal to see the sensitivity classifications there.
  • The entire process of adding and managing sensitivity classifications may be invisible to an ETL developer. A data modeler or business analyst might be the person adding and managing the sensitivity classifications. If the ETL developer isn’t aware classifications have been added, they won’t know to go and look for them in the sys.sensitivity_classifications view.
  • SSDT does not yet support sensitivity classifications. The only way I have found to add them into the database project is as a post-deployment script with the build property set to none.

The good news is that you can add the sensitivity classifications back to your dimension table using T-SQL. The bad news is still that the ETL developer must remember to do it. My workaround for now is a stored procedure that will do the the rename and drop of the tables plus copy the sensitivity classifications over. My hope is that it it’s easier to remember to use it since it will do the rename and drop for you as well.

Eventually, the tools will be updated to provide more visibility to data sensitivity classifications, but we still need to make sure they don’t get dropped.

For now, my recommendation is if you are going to go in and add a lot of sensitivity classifications, that you create a user defined restore point immediately after so that you know you have them in a backup somewhere. Azure SQL DW doesn’t do point-in-time restores the way Azure SQL DB does. It takes automatic restore points every 8 hours or so. So if someone went through the trouble of adding the sensitivity classifications and they were dropped through the data load process, there is no guarantee that you could use a backup to get them back.

Vote for My Enhancement Idea

If you would like Microsoft to add something to the product to keep sensitivity classifications from being dropped, or at least make it easier to add them back, please vote for my idea.

Not an Issue with Other Data Load Methods

Please note that if you are using other tools or methods to load your tables where you don’t swap them out, you won’t have the issue of dropping your sensitivity classifications. But I wanted to bring up this issue because I can see people spending a lot of time adding them and then suddenly losing them, and I want everyone to avoid that frustration.

Give Data Classifications a Try

I think data classifications are a good addition to SQL DW. Anything that helps us efficiently catalog and manage our sensitive data is good. I have added them in my demo environment and hope to use them in a client environment soon.

Have you tried out data classifications in SQL DW or DB? What do you think so far? If not, what is keeping you from using them?

Contact the Author | Contact DCAC

DBCC Clonedatabase and Query Store

Cloning things is all the rage these days…..even Storm Troopers.

Anyway, by now, you probably have seen documentation on Query Store and DBCC Clonedatabase.  Query Store, introduced in SQL Server 2016, is considered the “black box” or “flight recorder” of queries.  It retains a history of executed queries, including run-time statistics as well as execution plans from query store enabled databases.   This flight recorder helps to diagnosis and pinpoint query performance issues.

Another useful tool is DBCC Clonedatabase.  This command started was released in older versions via service packs before being fully released in 2016.  This tool allows you to make a “schema” only copy of a database which will reside on the same server as the source database.  The schema includes both user and system schema, as well as any corresponding statistical data.

Why is this useful?

Let’s say you have a 1 terabyte database and wanted to perform query tuning on it. Storage is usually a premium commodity in most organizations and there isn’t a spare terabyte of storage just laying around.  DBCC Clonedatabase would allow you to make a schema only copy, including statistics.  Once the new database has been created, you could move the new database onto another server without having the requirement of large amounts of storage.  Since the database is really a schema-only copy, the footprint of the database is pretty small.  After moving the database, queries ran against it would utilize the statistics contained within to execute the query.  True, there isn’t any data in the database, but you can account for that when perform query performance analysis.

DBCC CLONEDATABASE (source_database_name, target_database_name) WITH [NO_STATISTICS],[NO_QUERYSTORE],[VERIFY_CLONEDB],[BACKUP_CLONEDB]

NO_STATISTICS – This option specifies that table/index statistics are excluded. Available with SQL Server 2014 SP2 CU3 & SQL Server 2016 SP1
NO_QUERYSTORE – This option specifies that query store data is excluded.  Available with SQL Server 2016 SP1
VERIFY_CLONEDB – Verifies the consistency of the new cloned database.  Available starting with SQL Server 2016 SP2
BACKUP_CLONEDB – Creates and verifies a backup of the newly cloned database.  Available starting with SQL Server 2016 SP2

The command works in this order:

  1. Creates a new destination database.  Same file layout as the source database however with the default file sizes from Model.
  2. Generates a snapshot of the source database.
  3. Copies the system metadata from the source to the target database created in step 1.
  4. All object schema is copied from the source to the target.
  5. Index statistics are copied from the source to the target.

Fairly easy and straight forward.  But wait! There’s more!

You will noticed the “WITH NO_QUERYSTORE” option in the command.  If you are running SQL Server 2016 SP1 or higher and utilizing Query Store for the source database, DBCC Clonedatabase will also bring over the query store data! This is enabled by default so if you do not want the query store data, you have to explicitly define using NO_QUERYSTORE.  This means, by default, if you clone a database you will get the flight recorder data from query store as well.

These two tools, query store and DBCC Clonedatabase are just another means to help troubleshoot performance issues.

 

 

 

© 2018, John Morehouse. All rights reserved.

Contact the Author | Contact DCAC

Using SWITCH On A Single Partition

Several months ago, I was looking at a question posted on ask.sqlservercentral.com. I discovered an answer to a question regarding how to move a large amount of data around quickly.  The solution was related to an aspect of table partitioning that I was not aware of.  If you aren’t familiar with partitions in SQL Server, you can “divide” up a table into different segments or partitions.  Often this is found in large tables (think millions/billions of rows) in order to quickly and efficiently move data around.  Moving partitions around is a meta-data operation which is what makes it so efficient.

From Books Online (BOL):

The data of partitioned tables and indexes is divided into units that can be spread across more than one filegroup in a database. The data is partitioned horizontally, so that groups of rows are mapped into individual partitions. All partitions of a single index or table must reside in the same database. The table or index is treated as a single logical entity when queries or updates are performed on the data

In SQL Server 2012 and later, you can have up to 15,000 partitions per partitioned table or index as long as you were running the x64 bit version of SQL Server.  If you happen to be running the x32 bit version of SQL Server, you are limited to 1,000 partitions.

Note: if you are still running a x32 bit version of SQL Server, you really need to move to x64 bit.  You’ll do yourself and your organization a huge favor if you do.  

Single Partition SWITCH

In a nutshell, you can use the SWITCH function to quickly move a table, which is a single partition, and all of its data to a new table or schema.

You can see partitions by doing a SELECT against sys.partitions.  Using AdventureWorks2014, we can see below that the Person.Address table has a clustered index (Index_id = 1) and 4 non-clustered indexes on it.  Every index has at least one partition, which is partition_number of one.  If any object has multiple partitions, you’d see it there with a corresponding “partition_number” that is greater than one.

 

Of course as with anything, there are limitations. The list below is not inclusive, so be sure to read the documentation thoroughly.

  • Both tables must exist before the SWITCH operation
  • The target partition must be empty
  • Tables must have the same column structure (including nullability) and order
  • The index structure(s) must be identical between the source and target tables
  • Any foreign key constraints must be identical and cannot be marked is not trusted

Let’s try out a demo!

The Fun Stuff

For simplicity, I will:

  1. Create a simple table and load 1000 records into it.
  2. Create a secondary empty table that is identical in structure to the first table.
    1. This table will be in a different schema, the Archive schema
  3. Use the SWITCH command to switch the partition from the first table into the second table.
-- create the source table
CREATE TABLE dbo.Switcharoo
( id INT);
GO
CREATE TABLE archive.Switcharoo
( id INT);
GO

Now that the table structures are created, I’ll load the source table, dbo.Switcharoo with 1 million records.  I’m using a numbers table courtesy of this blog post, slightly modified for the tables that I’m using.  If you are not familiar with a numbers table, it is a very quick and efficient way to load a large amount of numbers.  Using a numbers table, I was able to insert 1 million records into the source table in less than 3 seconds.  You’re milage may vary depending on your hardware, but it will be faster than an iterative based solution.

;WITH
 L0 AS(SELECT 1 AS C UNION ALL SELECT 1 AS O), -- 2 rows
 L1 AS(SELECT 1 AS C FROM L0 AS A CROSS JOIN L0 AS B), -- 4 rows
 L2 AS(SELECT 1 AS C FROM L1 AS A CROSS JOIN L1 AS B), -- 16 rows
 L3 AS(SELECT 1 AS C FROM L2 AS A CROSS JOIN L2 AS B), -- 256 rows
 L4 AS(SELECT 1 AS C FROM L3 AS A CROSS JOIN L3 AS B), -- 65,536 rows
 L5 AS(SELECT 1 AS C FROM L4 AS A CROSS JOIN L4 AS B), -- 4,294,967,296 rows
 Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N FROM L5)
 
 INSERT INTO dbo.Switcharoo (id) SELECT TOP 1000000 N FROM Nums ORDER BY N;

We can see below that the source table now has rows. We can also see that the target table, archive.Switcharoo has zero rows.

Now that we have a good amount of data to switch, we can actually switch the partition from dbo.Switcharoo into archive.Switcharoo.   This is completed with a simple ALTER TABLE command.

ALTER TABLE dbo.Switcharoo SWITCH TO archive.Switcharoo

Because the switch is just metadata, it’s very quick.  Less than a second to move all 1 million records from one table to the other.  As you can see below, now the Archive.Switcharoo table has 1 million records and the dbo.Switcharoo table has zero.

Summary

Each individual table has at least one partition.  If you have to move a large amount of data around even with a single partition, switching them might be a quick, easy, and efficient solution.  A single line of code might just save you a lot of time and effort.  Make sure to read the documentation thoroughly.

If you want to learn more about partitioning (with multiple partitions), Cathrine Wilhelmsen (B|T) has a good series on partitioning. You can read it here.

Enjoy!

© 2018, John Morehouse. All rights reserved.

Contact the Author | Contact DCAC
1 2 3

Video

Globally Recognized Expertise

As Microsoft MVP’s and Partners as well as VMware experts, we are summoned by companies all over the world to fine-tune and problem-solve the most difficult architecture, infrastructure and network challenges.

And sometimes we’re asked to share what we did, at events like Microsoft’s PASS Summit 2015.

Awards & Certifications

Microsoft Partner   Denny Cherry & Associates Consulting LLC BBB Business Review    Microsoft MVP    Microsoft Certified Master VMWare vExpert
INC 5000 Award for 2020    American Business Awards People's Choice    American Business Awards Gold Award    American Business Awards Silver Award    FT Americas’ Fastest Growing Companies 2020   
Best Full-Service Cloud Technology Consulting Company       Insights Sccess Award    Technology Headlines Award    Golden Bridge Gold Award    CIO Review Top 20 Azure Solutions Providers