Data Saturdays Logo Contest

Published On: 2021-03-23By:

Lots of exciting things are happening with Data Saturdays and we want you, the #SQLFamily, involved.

Last week, Rob Sewell (B|T) announced the new automation and setup process for events.  This week I am announcing a new logo and branding initiative. With the generous donation from Denny Cherry and Associates Consulting, DCAC, we have commissioned 99Designs artists to create a new logo for branding. The Data Saturdays Admins reviewed over 220 submissions these were considered the likeability, accessibility, inclusiveness, and ease of multi-use. We have narrowed down the choices to these four.  Please take a moment to click the link below and rate your favorite, we will use your influence to make the final decision.

We’re excited to show off the new branding and deliver to you a full marketing layout to help you with branding your events using the collaborative community owned Data Saturdays resources.

This contest will only stay open for 3 full days ! Please get your votes in soon!!!

CONTEST LINKhttps://99designs.com/contests/poll/6f3ab0edcb

Please be sure to click into each option, via the link, and zoom in for detail to see the full layout of design idea.

Feel free to reach out to any of these Data Saturday Admins with questions.

Contest Closing Friday, March 26th 9pm EST

The post Data Saturdays Logo Contest appeared first on A Shot of SQLEspresso.

Contact the Author | Contact DCAC

Azure Site to Site VPN to a Cisco ASA v8.6+

Published On: 2021-03-22By:

Recently I was moving a client from their existing CoLo to Azure. Right now they are going with IaaS for their web servers and PaaS (Azure SQL Managed Instance) for their databases. This means that we needed to set up a Site-to-Site VPN between their existing on-prem environment at the CoLo and their Azure vNet so that they could transfer everything over, replicate AD, etc. Their on-prem router was a Cisco ASA running version 8.6 of the Cisco IOS.

Our plan was to enable BGP on the connection as we had several vNet’s that we wanted to have connected to the VPN, and we only wanted one VPN gateway. However, with a Cisco ASA, we found that BGP wasn’t an option as Azure has a new feature available within the configuration of the connect called “Policy Based Traffic Selector” which required that BGP be disabled.

Site-To-Site VPN Connection Properties in the Azure Portal

The problem that we were facing with our VPN connection was that even though all of the IKE Phase 1 and IKE Phase 2 settings were connect (we verified them against the Cisco ASA settings probably 10 times) we were still getting protocol mismatch when the VPN connection would attempt to come online.

We eventually tried turning on the “Use policy based traffic selector” option on the connection (which made us disable BGP) and the connection came up on the first try. For this, we were using a policy-based VPN Endpoint, and these settings are pretty new (it’s March 2021) as I don’t remember seeing them on previous VPN connections.

As you can see from the screenshot above, we specified the IKE Phase 1 and IKE Phase 2 settings, but this doesn’t need to be done as using any of the settings that Microsoft has will work, but we wanted to take the guesswork out of the migration process.

The end result was one of success, the tunnel came up and we could set up things like AD Replication so that we could continue with the migration process.

Denny

Contact the Author | Contact DCAC

Initial Thoughts on Dremio

Published On: 2021-03-18By:

I’ve been working on a project for the last few months with a client who has chosen to implement Dremio in Azure. Dremio is a data lake engine that creates a semantic layer and supports interactive queries.

Dremio logo
The Dremio logo

It uses Apache Arrow, Gandiva, and Parquet files under the hood. It runs on either Linux VMs or Kubernetes containers. Like most big data systems, there is at least one coordinator node and one or more executor nodes. These nodes communicate and are managed using Apache Zookeeper. Client applications connect to Dremio via ODBC, JDBC, REST APIs, or Arrow Flight. Dremio can read from storage accounts, external databases, and a few other sources.

Dremio stores data in the following places:

  • Metadata is stored in a RocksDB database on the coordinator node.
  • Frequently read data is cached on the executor node.
  • Memory-intensive query operations may cause an executor node to spill Arrow buffers from RAM to disk.
  • Reflections, user uploads, and query results are stored in the data lake.

Dremio is organized into spaces, which can contain folders and datasets. The key objects in Dremio are:

  • Data source – connection strings to data that should be accessed via Dremio
  • Physical Dataset – an HDFS directory or a database table
  • Virtual Dataset – a view of sorts, created using the Dremio UI or by writing SQL, that references one or more physical or virtual datasets and also provides lineage to its sources
  • Reflection – a materialized view that is transparent to users and is used to improve query performance, which seems to be implemented as Dremio querying data from the source and storing it as a parquet file for quicker access.
  • Space – a shared location for virtual datasets, a way to group related datasets and provide user access

Once you have your spaces and virtual datasets set up, it feels kind of like a database. If you connect with Power BI, virtual datasets appear as views and physical datasets appear as tables. Dremio metadata (catalogs, schemas, physical datasets, virtual datasets and columns) can be accessed using INFORMATION_SCHEMA queries, which is conveniently familiar if you are used to working with SQL Server.

Some nice features found in Dremio on Azure

  • Dremio allows Single Sign-On with AAD credentials. Permissions can be granted to individual users or AAD groups.
  • Dremio can be implemented in a virtual network in Azure. The executor nodes can use Private Link to access ADLS (Azure Data Lake Storage Gen 2) over a private endpoint.
  • Changes to virtual datasets are tracked in Dremio. It’s easy to revert to a previous version at any time.
  • Dremio gives you visibility to the jobs running queries, both for ad hoc queries from client tools and for refreshing reflections.
  • Administrators can create rules to assign queries to different queues in order to provide workload isolation and predictability for users.
  • When reviewing jobs, you can see a sort of query plan as well as which jobs were able to use a reflection to accelerate a query.
  • The lineage view for a virtual dataset is nice for understanding dependencies.
  • You can trigger refreshes of metadata or reflections via the Rest API, which is handy if you have ETL processes adding new data to your data lake, and you want to refreshes to occur at the end of the ETL process.

Some rough edges on Dremio in Azure

  • Dremio was initially built for AWS, not Azure. This is evident in the training materials, the product roadmap, and the knowledge of the Dremio implementation specialists. This is not to say it doesn’t work on Azure, just that the implementation is a bit rougher (e.g., no Azure templates made for you), and a couple of features are unavailable.
  • Dremio doesn’t integrate with Azure Key Vault. You store the service principal secret or storage account access key in a configuration file on the Linux VM. I’ve been told this is on the roadmap, but I didn’t hear a date when it would be available.
  • You can enable integration points on the Dremio website where you can click a button to open a connection to a virtual dataset in a BI tool such as Power BI or Tableau. For Power BI, this downloads a PBIDS file with a connection to that specific virtual dataset. This would be fine if everything you need is in this one dataset, but if you need to reference multiple virtual datasets, this is a bit annoying. Think of it like connecting to a specific database table instead of to the database in general. You might want to use that table, but you might also want to find other useful tables to combine in your Power BI model. You can open Power BI and connect to Dremio in general and navigate from there with no problems. I’m just pointing out that the buttons in the UI don’t seem that useful.
  • Dremio doesn’t support passthrough authentication on ADLS. All queries to the data lake are made in the context of the Dremio application, not the individual user. This means that you may need to set permissions twice for your data lake if you have other tools directly accessing the data lake instead of using Dremio. The idea is that most tools will connect through Dremio to take advantage of the semantic layer. But it would be nice to have, just to simplify security.

Advice we received in training

  • Unlike with nesting views in SQL Server, it’s ok to create multiple layers of virtual datasets. You want to design the semantic layer (the virtual datasets) to reuse common logic instead of repeating it across multiple views.
  • The standard design pattern for the semantic layer is to have a layer of “staging views” that have a 1-to-1 mapping to physical datasets and very little transformation outside of fixing data types and light cleansing. On top of the Staging layer is the Business layer, which includes virtual datasets containing business logic. The Business layer should handle most of the query workload. On top of the Business layer is the Application Layer. This includes virtual datasets that are purpose-built to support specific applications or reports.
  • Star schemas are not optimal in Dremio. You likely want to denormalize even more than that. This is because it is more expensive to perform a join than to search through a large number of values in a column.
  • When creating a reflection, setting the sort column is somewhat like creating an index in a SQL database. It helps prune data when applying a query filter or performing a join.
  • Reflections can be used to partition data. If you find you have a single large file, you can use a Reflection to split it by a low cardinality value to improve query performance. When you do this, it creates a parquet file per partition.
  • Reflections can be set to use an incremental refresh, but only if the data is additive and existing data is not updated.
  • You don’t need a reflection for everything. Make them as small and reusable a possible.
  • Try to avoid thousands of tiny files, and aim for a few medium to large files (MBs to GBs). This is common for most data lake engines as there is an overhead cost for file enumeration.

Some other thoughts

  • Dremio advertises that you don’t need data integration processes like you would for a data warehouse. I find this to be somewhat inaccurate for two main reasons. First, if you need to acquire data from APIs or other applications to which Dremio can’t connect, you will still need to copy data to your data lake. Second, when you use a Reflection to speed up a query, you are creating a copy of the data in your data lake stored as one or more Parquet files. Data virtualization technology hasn’t actually matured to the point of not needing ETL at all. I can see how Dremio would lessen the need for ETL, but let’s recognize that you’ll probably still need some and that Dremio is doing a bit of data loading of it’s own. So the question becomes where — and with what tools — you would like to do this. You can have Dremio do your transforming and loading in the form of reflections, or you can load your own data already transformed to the data lake. You will likely end up with a bit of both over time.
  • Consider the skillsets of the people who will manage the system, as well as those who will build and query the datasets. If you have a team of admins who only know Windows, they are going to need to skill up on Linux. If your BI team or analysts don’t know SQL, they will probably struggle to build the virtual datasets.
  • This system can get pretty expensive pretty fast (which is true of most big data systems). You’ll want to be sure to automate the shutdown of the nodes in dev and test environments when they are not in use, so you can save a bit of money. And remember that you can size up your nodes later if you find you don’t have adequate performance. Oversizing at the outset will waste money.
  • Dremio is a (well-funded) startup with a product that is built on several open source technologies, and they don’t seem to have a public roadmap. In my experience, they have been good about taking feedback to add to the roadmap and with sharing what is soon to be released. But if you are building your company’s BI strategy with Dremio as a key tool, you probably want more than that. It sounds like they share more with paying customers. I would want that information before making a purchasing decision.
  • Overall, I can see why Dremio has been adopted by several large companies. And I have enjoyed setting up the Azure architecture around it and building virtual datasets. I wish they would add some Azure-specific features to optimize things and make security easy, but it’s a promising platform.

More Information about Dremio

If Dremio sounds interesting to you, here are a few helpful links

This was my first project using Dremio. If you’ve used Dremio, please share your experience in the comments.

Contact the Author | Contact DCAC

Initial Thoughts on Dremio

Published On: By:

I’ve been working on a project for the last few months with a client who has chosen to implement Dremio in Azure. Dremio is a data lake engine that creates a semantic layer and supports interactive queries.

Dremio logo
The Dremio logo

It uses Apache Arrow, Gandiva, and Parquet files under the hood. It runs on either Linux VMs or Kubernetes containers. Like most big data systems, there is at least one coordinator node and one or more executor nodes. These nodes communicate and are managed using Apache Zookeeper. Client applications connect to Dremio via ODBC, JDBC, REST APIs, or Arrow Flight. Dremio can read from storage accounts, external databases, and a few other sources.

Dremio stores data in the following places:

  • Metadata is stored in a RocksDB database on the coordinator node.
  • Frequently read data is cached on the executor node.
  • Memory-intensive query operations may cause an executor node to spill Arrow buffers from RAM to disk.
  • Reflections, user uploads, and query results are stored in the data lake.

Dremio is organized into spaces, which can contain folders and datasets. The key objects in Dremio are:

  • Data source – connection strings to data that should be accessed via Dremio
  • Physical Dataset – an HDFS directory or a database table
  • Virtual Dataset – a view of sorts, created using the Dremio UI or by writing SQL, that references one or more physical or virtual datasets and also provides lineage to its sources
  • Reflection – a materialized view that is transparent to users and is used to improve query performance, which seems to be implemented as Dremio querying data from the source and storing it as a parquet file for quicker access.
  • Space – a shared location for virtual datasets, a way to group related datasets and provide user access

Once you have your spaces and virtual datasets set up, it feels kind of like a database. If you connect with Power BI, virtual datasets appear as views and physical datasets appear as tables. Dremio metadata (catalogs, schemas, physical datasets, virtual datasets and columns) can be accessed using INFORMATION_SCHEMA queries, which is conveniently familiar if you are used to working with SQL Server.

Some nice features found in Dremio on Azure

  • Dremio allows Single Sign-On with AAD credentials. Permissions can be granted to individual users or AAD groups.
  • Dremio can be implemented in a virtual network in Azure. The executor nodes can use Private Link to access ADLS (Azure Data Lake Storage Gen 2) over a private endpoint.
  • Changes to virtual datasets are tracked in Dremio. It’s easy to revert to a previous version at any time.
  • Dremio gives you visibility to the jobs running queries, both for ad hoc queries from client tools and for refreshing reflections.
  • Administrators can create rules to assign queries to different queues in order to provide workload isolation and predictability for users.
  • When reviewing jobs, you can see a sort of query plan as well as which jobs were able to use a reflection to accelerate a query.
  • The lineage view for a virtual dataset is nice for understanding dependencies.
  • You can trigger refreshes of metadata or reflections via the Rest API, which is handy if you have ETL processes adding new data to your data lake, and you want to refreshes to occur at the end of the ETL process.

Some rough edges on Dremio in Azure

  • Dremio was initially built for AWS, not Azure. This is evident in the training materials, the product roadmap, and the knowledge of the Dremio implementation specialists. This is not to say it doesn’t work on Azure, just that the implementation is a bit rougher (e.g., no Azure templates made for you), and a couple of features are unavailable.
  • Dremio doesn’t integrate with Azure Key Vault. You store the service principal secret or storage account access key in a configuration file on the Linux VM. I’ve been told this is on the roadmap, but I didn’t hear a date when it would be available.
  • You can enable integration points on the Dremio website where you can click a button to open a connection to a virtual dataset in a BI tool such as Power BI or Tableau. For Power BI, this downloads a PBIDS file with a connection to that specific virtual dataset. This would be fine if everything you need is in this one dataset, but if you need to reference multiple virtual datasets, this is a bit annoying. Think of it like connecting to a specific database table instead of to the database in general. You might want to use that table, but you might also want to find other useful tables to combine in your Power BI model. You can open Power BI and connect to Dremio in general and navigate from there with no problems. I’m just pointing out that the buttons in the UI don’t seem that useful.
  • Dremio doesn’t support passthrough authentication on ADLS. All queries to the data lake are made in the context of the Dremio application, not the individual user. This means that you may need to set permissions twice for your data lake if you have other tools directly accessing the data lake instead of using Dremio. The idea is that most tools will connect through Dremio to take advantage of the semantic layer. But it would be nice to have, just to simplify security.

Advice we received in training

  • Unlike with nesting views in SQL Server, it’s ok to create multiple layers of virtual datasets. You want to design the semantic layer (the virtual datasets) to reuse common logic instead of repeating it across multiple views.
  • The standard design pattern for the semantic layer is to have a layer of “staging views” that have a 1-to-1 mapping to physical datasets and very little transformation outside of fixing data types and light cleansing. On top of the Staging layer is the Business layer, which includes virtual datasets containing business logic. The Business layer should handle most of the query workload. On top of the Business layer is the Application Layer. This includes virtual datasets that are purpose-built to support specific applications or reports.
  • Star schemas are not optimal in Dremio. You likely want to denormalize even more than that. This is because it is more expensive to perform a join than to search through a large number of values in a column.
  • When creating a reflection, setting the sort column is somewhat like creating an index in a SQL database. It helps prune data when applying a query filter or performing a join.
  • Reflections can be used to partition data. If you find you have a single large file, you can use a Reflection to split it by a low cardinality value to improve query performance. When you do this, it creates a parquet file per partition.
  • Reflections can be set to use an incremental refresh, but only if the data is additive and existing data is not updated.
  • You don’t need a reflection for everything. Make them as small and reusable a possible.
  • Try to avoid thousands of tiny files, and aim for a few medium to large files (MBs to GBs). This is common for most data lake engines as there is an overhead cost for file enumeration.

Some other thoughts

  • Dremio advertises that you don’t need data integration processes like you would for a data warehouse. I find this to be somewhat inaccurate for two main reasons. First, if you need to acquire data from APIs or other applications to which Dremio can’t connect, you will still need to copy data to your data lake. Second, when you use a Reflection to speed up a query, you are creating a copy of the data in your data lake stored as one or more Parquet files. Data virtualization technology hasn’t actually matured to the point of not needing ETL at all. I can see how Dremio would lessen the need for ETL, but let’s recognize that you’ll probably still need some and that Dremio is doing a bit of data loading of it’s own. So the question becomes where — and with what tools — you would like to do this. You can have Dremio do your transforming and loading in the form of reflections, or you can load your own data already transformed to the data lake. You will likely end up with a bit of both over time.
  • Consider the skillsets of the people who will manage the system, as well as those who will build and query the datasets. If you have a team of admins who only know Windows, they are going to need to skill up on Linux. If your BI team or analysts don’t know SQL, they will probably struggle to build the virtual datasets.
  • This system can get pretty expensive pretty fast (which is true of most big data systems). You’ll want to be sure to automate the shutdown of the nodes in dev and test environments when they are not in use, so you can save a bit of money. And remember that you can size up your nodes later if you find you don’t have adequate performance. Oversizing at the outset will waste money.
  • Dremio is a (well-funded) startup with a product that is built on several open source technologies, and they don’t seem to have a public roadmap. In my experience, they have been good about taking feedback to add to the roadmap and with sharing what is soon to be released. But if you are building your company’s BI strategy with Dremio as a key tool, you probably want more than that. It sounds like they share more with paying customers. I would want that information before making a purchasing decision.
  • Overall, I can see why Dremio has been adopted by several large companies. And I have enjoyed setting up the Azure architecture around it and building virtual datasets. I wish they would add some Azure-specific features to optimize things and make security easy, but it’s a promising platform.

More Information about Dremio

If Dremio sounds interesting to you, here are a few helpful links

This was my first project using Dremio. If you’ve used Dremio, please share your experience in the comments.

Contact the Author | Contact DCAC
1 2 3 4 487

Video

Globally Recognized Expertise

As Microsoft MVP’s and Partners as well as VMware experts, we are summoned by companies all over the world to fine-tune and problem-solve the most difficult architecture, infrastructure and network challenges.

And sometimes we’re asked to share what we did, at events like Microsoft’s PASS Summit 2015.

Awards & Certifications

Microsoft Partner   Denny Cherry & Associates Consulting LLC BBB Business Review    Microsoft MVP    Microsoft Certified Master VMWare vExpert
INC 5000 Award for 2020    American Business Awards People's Choice    American Business Awards Gold Award    American Business Awards Silver Award    FT Americas’ Fastest Growing Companies 2020   
Best Full-Service Cloud Technology Consulting Company       Insights Sccess Award    Technology Headlines Award    Golden Bridge Gold Award    CIO Review Top 20 Azure Solutions Providers
Share via
Copy link