Last week, Kerry and I delivered a webinar with tips on how to set up your Data Factory. We discussed version control, deployment, naming conventions, parameterization, documentation, and more.
Here’s our agenda from the presentation.
If you missed the webinar, you can watch it online now. Just go to the DCAC website, fill in the required fields with your info, and the video will be shown.
If you’d like a copy of the slides, you can download the PDF here. There is a list of helpful links at the end that you may want to check out.
I hope you enjoyed our webinar. Leave me a comment if you have other experiences with ADF where a design or configuration choice you made in the beginning was difficult or tedious to fix later. Help other ADF developers avoid those mistakes.
As I was browsing Twitter today, I noticed a tweet from the Colorado Department of Transportation about their anti-DUI campaign. Shown below, it contains a bar chart that appears to have been presented in PowerPoint.
There are some easy opportunities to improve the readability of this chart, so I thought I would use it as an example of how small improvements can have a big impact on a fairly simple chart. I recreated the chart (as best I could) in Power BI and then made two revised versions.
Especially when making data visualizations for the general public —and especially when you want to get people’s engagement on social media— you need to reduce perceived cognitive load. Otherwise, people won’t even bother to read your chart. If your chart feels too busy or too complicated, many people in your intended audience will feel it is not worth the effort to even try to read it and will move on down their Twitter feed to the next Anakin and Padme meme.
Each resource type in Azure has a naming scope within which the resource name must be unique. For PaaS resources such as Azure SQL Server (server for Azure SQL DB) and Azure Data Factory, the name must be globally unique within the resource type. This means that you can’t have two data factories with the same name, but you can have a data factory and a SQL server with the same name. Virtual machine names must be unique within the resource group. Azure Storage accounts must be globally unique. Azure SQL Databases should be unique within the server.
Since Azure allows you to create a data factory and a SQL server with the same resource name, you may think this is fine. But you may want to avoid this, especially if you plan on using system-defined managed identities or using Azure PowerShell/CLI. And if you aren’t planning on using these things, you might want to reconsider.
I ran into this issue of resources with the same name in a client environment and then recreated it in my Azure subscription to better understand it.
I already had a data factory named adf-deploydemo-dev so I made an Azure SQL server named adf-deploydemo-dev and added a database with the same name.
Azure Data Factory should automatically create its system-assigned managed identity. It will use the resource name for the name of the service principal. When you go to create a linked service in Azure Data Factory Studio and choose to use Managed Identity as the authentication method, you will see the name and object ID of the managed identity.
For the Azure SQL Server, we can create a managed identity using PowerShell. The Set-AzSqlServer cmdlet has an -AssignIdentity parameter, which creates the system-assigned managed identity.
If you use Get-AzSqlServer to retrieve the information and assign the Identity property to a variable, you can then see the system-assigned managed identity and its application ID.
Now when I look in Active Directory, I can see both managed identities have the same name but different application IDs and object IDs.
Everything is technically working right now, but I have introduced some needless ambiguity that can cause misunderstandings and issues.
Let’s say that I want to grant the Storage Blob Data Reader role to my data factory. I go to the storage account, choose to add a role assignment, select the role, and then go to add members. This is what I see:
Or let’s say that I use PowerShell to get lists of resources by name. I may be locating resources to add tags, add a resource lock, or move the resource to another region or resource group.
If I don’t specify the resource type, I will get my data factory, my database, and my server in the results. You may be saying “Well, I would always specify the type.” Even if that is true, are you sure all coworkers and consultants touching your Azure resources would do the same?
Why introduce this ambiguity when there is no need to do so?
There are some good tips in the Cloud Adoption Framework in Microsoft Docs about naming conventions. Your organization probably wants to decide up front what names are acceptable and then use Azure Policy as well as good processes to ensure adherence to your defined conventions. And if I were the consultant advising you, I would suggest that resources within your tenant be unique across resource types. The suggestion in Docs is to use a resource type abbreviation at the beginning of your resource name. That would avoid the issue I have demonstrated above. Naming conventions should be adjusted to your organization’s needs, but the ones suggested in Docs are a good place to start if you need some help. It is beneficial to have some kind of resource naming convention beyond just whatever is allowed by Azure.
In week 26 of Workout Wednesday for Power BI, I asked people to calculate the age of Nobel laureates at the time they received the award. I provided some logic, but I didn’t prescribe how to create the age calculation. This inspired a couple of questions and a round of data validation as calculating age may be trickier than you think. In this post, I’ll explore some of the ways people have calculated age in Power BI and the edge cases where those calculations may not work.
In my solution video for Workout Wednesday, I used Power Query to calculate age. This was inspired by several blog posts and videos I had seen previously. There is an Age menu option in the Power Query editor under Date.
When you select a date column and use that Age option, it calculates the duration between the selected date and the current date in days. You must then replace the current date with the second date column. Next you can choose Total Years under Duration, which divides the days by 365. Finally, you must round that number down to the next integer to get years.
If you follow Ruth’s video, you can do all of that in one step that creates a custom column with the final age value.
Age Quotient (DAX):
Age Quotient =
VAR Birthdate = [Date1]
VAR ThisDay = [Date2]
VAR IntBirthdate = YEAR ( Birthdate ) * 10000 + MONTH ( Birthdate ) * 100 + DAY ( Birthdate )
VAR IntThisDay = YEAR ( ThisDay ) * 10000 + MONTH ( ThisDay ) * 100 + DAY ( ThisDay )
VAR Age = QUOTIENT ( IntThisDay - IntBirthdate, 10000 )
VAR CheckedAge = DIVIDE ( Age, NOT ISBLANK ( Birthdate ) )
As Marco points out, many people were using YEARFRAC, but there is a bug in the DAX implementation that causes it to occasionally return an incorrect answer for this purpose.
Checking the Numbers
I created a Power BI file to demonstrate the differences in these four calculations. You can download the file here. The image below displays the results in several tests. For each row, I’m using Date1 as the birthdate and Date2 as the “as of” date. You’ll notice that I focused on leap years for a few cases.
There are six of ten date ranges that have different results across the different calculation methods.
In the second row, the Power Query age calculation says that Feb 29 to Feb 28 in the following year is a full year. This may or may not be what you want depending on your requirements. I’m noting the difference so you can be aware. A similar thing occurs in the fifth row going from Feb 29, 2016 to Feb 28, 2020, and again on the 9th row going from March 1, 2019 to Feb 29, 2020.
On the third row, notice that the DAX DATEDIFF function calculates Feb 29 to Feb 27 of the following year to be a full year, despite it being a day or two short. Depending on what you do with leap years, you might consider Feb 29 to Feb 28 in the following year to be a full year, but that third row result means DATEDIFF is probably not the calculation I want. We see a similar result going from March 1 to Feb 28 of the following year.
YEARFRAC calculates that Feb 29 to Feb 28 in the following year is not a full year, which may be desirable. But it counts Feb 29, 2016 to Feb 29, 2020 as only three years. And we see that March 1, 2000 to March 1, 2021 is only counted as 20 years. So even without starting on a leap year, we get some incorrect results. Small numbers seem to be correct until it gets to about 13 years.
Using the QUOTIENT function provides what I consider to be the most correct results. It calculates Feb 29 to Feb 28 of the following year to be less than a year. It calculates Feb 29, 2016 to Feb 28, 2016 to be three years and not four. And it calculates March 1 to Feb 29 of the following year to be less than a year.
Which to use?
The QUOTIENT formula produces the most accurate results if you don’t want Feb 29 to Feb 28 the next year to be counted as a year. DATEDIFF and YEARFRAC produce too many incorrect results for me to ever suggest using them. Since there is a DAX option that produces more correct answers, I would just go for QUOTIENT instead of either of these two.
UPDATE: There is a better alternative! Imke Feldmann reminded me that there is an Number.IntegerDivide function in Power Query. So let’s take the logic from Marco’s DAX calculation and move it to Power Query:
(BirthDate as date, EndDate as date) => let BirthDateInt = Date.Year(BirthDate)10000 + Date.Month(BirthDate)100 + Date.Day(BirthDate), EndDateInt = Date.Year(EndDate)10000 + Date.Month(EndDate)100 + Date.Day(EndDate), Age = Number.IntegerDivide((EndDateInt - BirthDateInt),10000) in Age
Are you new to Azure Data Factory and wondering what you don't know you don't know? The learning curve with new technologies can sometimes lead to some major refactoring down the line once we realize our mistakes. Join Meagan Longoria and Kerry Tyler to learn how to set up your data factory for success. They will start by discussing naming conventions, parameterization, Key Vault usage, and deployment with Azure DevOps. Then they'll share their recommendations on pipeline hierarchies, activity dependencies, error handling, and monitoring. Watch this webinar to help your organization avoid Data Factory regrets!
Watch Denny and Joey from DCAC, and Rob Krug from Avast as they talk about enterprise security, where companies fail from a security perspective, and what small / medium companies can do to get enterprise-grade security features without breaking the bank.
As Microsoft MVP’s and Partners as well as VMware experts, we are summoned by companies all over the world to fine-tune and problem-solve the most difficult architecture, infrastructure and network challenges.
And sometimes we’re asked to share what we did, at events like Microsoft’s PASS Summit 2015.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.