Some of the most eagerly anticipated features now available in SQL Server 2019 relate to expanded data virtualization capabilities. Data virtualization enables you to bring all your data together when you query it instead of using complex ETL procedures to physically move the data into one place. In other words, the data stays in its original location, but you can query it just as if it was local.
Data virtualization in SQL Server uses built-in technology known as PolyBase. PolyBase has been available in SQL Server since 2016, but the types of data sources it can query have been significantly expanded in SQL Server 2019. These include Hadoop, Azure Blob Storage, external SQL Server instances, Oracle, Teradata, MongoDB, and any database using ODBC standards. Of course, if your data is in a CSV file, you can query that, too.
Given the growing practice of choosing the database type that best fits the purpose at hand rather than defaulting to one standard or another across the enterprise, PolyBase is likely to become an essential part of the DBA’s toolkit. It allows those different data sources to be consumed and analyzed in an efficient manner.
One of the latest buzzword is “Big Data Clusters”. Thankfully, PolyBase is a natural complement to SQL Server Big Data Clusters. These scalable clusters of SQL Server, Spark, and HDFS Containers run on Kubernetes. This empowers you to analyze big data using Transact-SQL or Spark—putting relational data and big data on equal footing to maximize the value of your entire data estate. With PolyBase, you can use Big Data Clusters to query external data sources without moving or copying the data. This provides an interesting alternative to data lake or data mart approaches that require moving or ingesting data into a centralized system.
If you have SQL Server 2019 Enterprise Edition, you already have access to PolyBase. It’s relatively simple to enable it on a SQL Server instance. When setting up the metadata to access the external data sources, a master key is used to secure access. Once this is accomplished, you can create a database scoped credential to authenticate against those sources. The next step is to specify the external data source, including its location. If you want to keep the external table that defines data source objects separate from internal tables, you can create an external schema.
Once it’s all set up, querying is completely transparent—you can run queries just as if the data was stored in the local database. In fact, PolyBase can push the compute to the external source, often resulting in significant performance improvements. Just remember to enable this “pushdown” feature when you’re setting up your virtualized data sources and keep an eye on the performance of your target database.
With fluid, high performance data virtualization built into SQL server 2019, it’s now possible to create innovative solutions that use heterogeneous data in a fraction of the time it would take to design and deploy a traditional data warehouse. In general, virtualization requires less hardware, fewer software licenses, less coding, and just generally less hassle. While it certainly won’t eliminate the need to move data, it opens many new exciting possibilities for getting value from what you have, where you have it.
Data virtualization is a game changer for data professionals as it allows data consumption from a centralized location. This will help to make the process easier to develop, enhance and maintain. Microsoft continues to enhance the realm of data with these new features and I’m excited to see what they bring to the table next. To learn more about what you can do with Microsoft SQL 19, check out the free Packt guide Introducing Microsoft SQL 19.
© 2019, John Morehouse. All rights reserved.Contact the Author | Contact DCAC