It’s not always obvious when you need a data gateway in Azure, and not all gateways are labeled as such. So I thought I would walk through various applications that act as a data gateway and discuss when, where, and how many are needed.
Note: I’m ignoring VPN gateways and application gateways for the rest of this post. You could set up a VPN gateway to get rid of some of the data gateways, but I’m assuming your networking/VPN situation is fixed at this point and working from there. This is a common case in my experience as many organizations are not ready to jump into Express Route when first planning their BI architecture.
Let’s start with what services may require you to use a data gateway.
You will need a data gateway when you are using Power BI, Azure Analysis Services, PowerApps, Microsoft Flow, Azure Logic Apps, Azure Data Factory, or Azure ML with a data source/destination that is in a private network that isn’t connected to your Azure subscription with a VPN gateway. Note that a private network includes on-premises data sources and Azure Virtual Machines as well as Azure SQL Databases and Azure SQL Data Warehouses that require use of VNet service endpoints rather than public endpoints.
Luckily, many of these services can use the same data gateway. Power BI, Azure Analysis Services, PowerApps, Microsoft Flow, and Logic Apps all use the On Premises Data Gateway. Azure Data Factory (V1 and V2) and Azure Machine Learning Studio use the Data Factory Self-Hosted Integration Runtime.
On Premises Data Gateway (Power BI et al.)
If you are using one or more of the following:
- Power BI
- Azure Analysis Services
- Microsoft Flow
- Logic Apps
and you have a data source in a private network, you need at least one gateway. But there are a few considerations that might cause you to set up more gateways.
- Your services must be in the same region to use the same gateway. This means that your Power BI/Office 365 region and Azure region for your Azure Analysis Services resource must match for them to all use one gateway. If you have resources in different regions, you will need one gateway per region.
- You may want high availability for your gateway. You can create high availability clusters so when one gateway is down, traffic is rerouted to another available gateway in the cluster.
- You may want to segment traffic to ensure the necessary resources for certain ad hoc live/direct queries or scheduled refreshes. If your usage and refresh patterns warrant it, you may want to set up one gateway for scheduled refreshes and one gateway for live/direct queries back to any on-premises data sources. Or you might make sure live/direct queries for two different high-traffic models go through different gateways so as not to block each other. This isn’t always warranted, but it can be a good strategy.
Data Factory Self-hosted Integration Runtime
If you are using Azure Data Factory (V1 or V2) or Azure ML with a data source in a private network, you will need at least one gateway. But that gateway is called a Self-hosted Integration Runtime (IR).
Self-hosted IRs can be shared across data factories in the same Azure Active Directory tenant. They can be associated with up to four machines to scale out or provide higher availability. So while you may only need one node, you might want a second so that your IR is not the single point of failure.
Or you may want multiple IRs to boost throughput of copy activities. For instance, copying from an on-premises file server with one IR node is about 195 Megabytes per second (MB/s). But with 4 IR nodes, it can be as fast as 505 MB/s.
Factors that Affect the Number of Data Gateways Needed
The main factors determining the number of gateways you need are:
- Number of data sources in private networks (including Azure VNets)
- Location of services in Azure and O365 (number of regions and tenants)
- Desire for high availability
- Desire for increased throughput or segmented traffic
If you are importing your data to Azure and using an Azure SQL DB with no VNet as the source for your Power BI model, you won’t need an On Premises Data Gateway. If you used Data Factory to copy your data from an on-premises SQL Server to Azure Data Lake and then Azure SQL DB, you need a Self-Hosted Integration Runtime.
If all your source data is already in Azure, and your source for Power BI or Azure Analysis Services is Azure SQL DW on a VNet, you will need at least one On-Premises Data Gateway.
If you import a lot of data to Azure every day using Data Factory, and you land that data to Azure SQL DW on a VNet, then use Azure Analysis Services as the data source for Power BI reports, you might want a self-hosted integration runtime with a few nodes and a couple of on-premises gateways clustered for high availability.
Have a Plan For Your Gateways
The gateways/integration runtimes are not hard to install. They are just often not considered, and projects get stalled waiting until a machine is provisioned to install them on. And many people forget to plan for high availability in their gateways. Make sure you have the right number of gateways and IR nodes to get your desired features and connectivity. You can add gateways/nodes later, but you don’t want to get caught with no high availability when it really matters.