Posts

Working with Large Tables in SQL Server

Warning: This blog contains old tricks of an old dog.

Scenario: Suppose you have a large table in SQL Server, e.g. hundreds of millions or even a billion rows. DML operations (SELECT, INSERT, UPDATE, DELETE) take long time. How do you speed them up? Do you split the large table into multiple tables? Or, do you ask for better hardware? Or, do you start looking for a new job with less data?

Solution: It’s nothing new but I see clients struggle with this all the time because they don’t know any better.

The solution is to partition the table and use partition switching that SQL Server has supported since time immemorial.

Cathrine Wilhelmsen has a great step-by-step blog covering different scenarios, but the process goes like this:

  1. Configure page compression for the large table (see benefits here).
  2. Partition the large table, such as by month.
  3. Create a not-partitioned staging table that has the same indexes and compression as the large table.
  4. Find the corresponding partition in the large table that will require DML, such as by using this script.
  5. If the data requires updates, switch out the affected partition to the staging table. Perform updates. For full loads where rows will be only inserted, you don’t have to switch out the partition (see the second scenario in Cathrine’s blog).
  6. Switch in the staging table into the corresponding partition of the large table. This should take a few seconds.

As a bonus, the SQL Server query processor could eliminate partitions for SELECTs, thus improving the query performance.

Prologika Newsletter Winter 2021

Happy Holidays! More and more organizations consider data virtualization to abstract the underlying storage and integrate siloed sources. In this letter, I’ll discuss a real-life project that used PolyBase to expose third-party ERP data as SQL tables. Before I get to the subject of this newsletter, I’m excited to announce the seventh edition of my “Applied Microsoft Power BI” book. It should be available on Amazon in the first days of 2022. As far as I know, it’s the only book that is updated annually to keep it up to the date with the fast-changing Power BI. Stay tuned for a future blog with more details about the book.

 

Business Case

Think of data virtualization is a logical data layer that integrates enterprise data across various on-premises and cloud sources. A large, multinational chemical manufacturer decided to migrate their on-premises ERP system to the cloud. As usually happens, the tradeoff for embracing the cloud is losing access to your data in its native storage. Previously, the client could readily integrate the ERP data stored in a SQL Server database. But the ERP vendor didn’t support this option in their cloud offering. The usual explanation cites security and performance issues, although none of them really hold water. The ERP vendor could have supported a premium tier where data is exposed privately without affecting other customers, and report queries could have been redirected to a secondary replica. This is no different that securing and scaling an Azure SQL Database. Alas, as more and more companies find when embracing the cloud, the integration burden gets heavier and is on them and not on the vendor.

To make things even more difficult, the vendor had a replication mechanism to export the data to AWS S3 data lake. Realizing that most clients would struggle calling their REST APIs, the vendor provided an JDBC driver that abstracted the APIs. Great, except that the client wanted to access the data on Microsoft Azure, but no Microsoft tool supports JDBC drivers because no Microsoft BI tool is written in Java.

Integration Options

One integration option could have been to use a JDBC-capable ETL tool, such as Pentaho. But that would have required implementing integration pipelines to pull the data periodically and stage it on Azure. This presented two issues. First, the data integration effort became more difficult as someone had to own and troubleshoot ETL failures. Second, Business wanted as much real-time access to data as possible. In the past, the business users had implemented self-service Power BI models that they would refresh as needed to cache the data from the on-premises database. However, since the ERP vendor required at least 20 minutes for data changes to be applied to the S3 data lake and ETL needed additional time (even with incremental extraction), the data latency became an issue.

The second option was to somehow virtualize the data. Had the vendor supported exporting the data as files on Azure, Synapse Serverless could have been used to expose the data as virtual tables that can be queried with SQL and loaded in Power BI Desktop. But Serverless doesn’t support AWS S3 and even if it did, the vendor didn’t allow direct access to the staged data (REST APIs and JDBC driver were the only supported options).

PolyBase to the Rescue

The solution I proposed was to use PolyBase which is included in SQL Server and Azure SQL Managed Instance. Ideally, the client wanted a full PaaS solution but only PolyBase in SQL Server supports ODBC. So, we had to use an IaaS VM just to virtualize the data. This diagram shows the solution architecture.

An Azure VM was provisioned with SQL Server 2019 Standard Edition. The JDBC driver was installed on the VM and configured to access the ERP data. We used a third-party JDBC-to-ODBC bridge driver to map the JDBC data source as an ODBC data source. Then, PolyBase external tables were set up to virtualize the ERP data as SQL Server tables.

The main drawback of this solution was that no matter how small the source table was, PolyBase would add about 30 seconds in internal processing. Specifically, the PolyBase runtime log has a detailed trail that shows that it takes some time for PolyBase to “warm up” before it gets to the query, and then it needs even more time to process the results. That’s because, as a distributed system (like Synapse), a head node coordinates the query execution with data nodes even if everything is installed on a single VM. More gotchas specific to the fact that the vendor has decided to use Oracle as their relational database can be found at https://prologika.com/polybase-adventures/.

Conclusion

Microsoft has made bold strides in data virtualization. I’m really impressed by Synapse Serverless, which I used for other projects, such as for the project described in this case study. I wish Microsoft extends Serverless to support more storage options. If Synapse Serverless is not an option, your next best bet would be PolyBase. Although PolyBase is supported in SQL MI and Synapse, only the SQL Server box SKU supports ODBC data sources, requiring an IaaS layer to virtualize the data.

Benefits

The solution delivered the following benefits to the client:

  • No ETL effort – Data was left at the original source.
  • Data virtualization – Polybase was used to create external tables that can be queried just like SQL Server regular tables.
  • Reduced data latency – Data changes were available as soon as they are replicated to the data lake.
  • Scalability – PolyBase can be scaled out to other servers if needed.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

PolyBase Adventures

I’m setting SQL Server 2019 PolyBase for ODBC to JDBC access to a vendor data lake to virtualize entities as SQL tables. Overall, a smooth experience with a few gotchas:

Data type mappings

The vendor lake uses Oracle data types TIMESTAMP AT TIME ZONE and BOOLEAN that Java doesn’t know how to map. The solution was to set up a view in the data lake (luckily the vendor supports that) to cast these data types to NVARCHAR and INTEGER.

NullPointerException

Once the table is finally set up what do we get when querying it?

105082;Generic ODBC error: java.lang.NullPointerException .

How do we fix this horrible issue? Upgrade SQL Server and PolyBase to the latest cummulative update (CU).

The final mystery that I haven’t been able to crack yet is that for some obscure reason, PolyBase adds quite a bit of performance overhead to the query execution. So, if a query in DBeaver directly connected to the lake (or Power BI Desktop directly corrected to the ODBC driver) takes eight seconds, PolyBase expands it to a minute. Examining the DMVs shows that the actual query does execute in line with DBeaver, but there is some additional overhead from PolyBase that would require a support case with Microsoft.

Prologika Newsletter Fall 2021

LabCorp operates one of the largest clinical laboratory networks in the world. It also has an Interactive Response Technology system that healthcare vendors can use to conduct case studies.  Thanks to the cloud data analytics solution implemented by Prologika, LabCorp and its vendors can now analyze data across case studies. Read this newsletter to learn more about the solution architecture and business value.

 

Business Needs

The data for each study was saved into a separate on-prem SQL database. The total number of databases was more than 1,000. After the initial assessment, Prologika realized that one of the main gaps was that vendors couldn’t report across their studies or gain performance insights from studies conducted by other vendors. Further, as the IRT system evolves over time and to accommodate special requests, there were scheme differences between different versions.

LabCorp underscored the importance of consolidating the data from multiple studies into a single repository. They envisioned a cloud-based PaaS BI solution that would extract data from all the on-prem databases without impacting the system performance and centralize it into an enterprise data warehouse. Vendors would log an external portal that will deliver embedded reports. The first iteration was focused on analyzing audit and log data to gain strategic insights, such as how many users are using the system.

Solution

After assessing the current state and objectives, Prologika recommended and implemented the following architecture:

Prologika implemented an Azure Data Factory (ADF) configurable framework to extract data from the on-prem databases hosted on two production SQL Servers in parallel. The framework would stage the data into partitioned parquet files in Azure Data Lake Storage (ADLS).

Then Prologika created views in Synapse Serverless to consolidate the file extracts. For example, an Audit table could exist in several on-prem databases on both production servers. While each table would be staged in a separate file (or multiple files if the table supports incremental extraction), a Synapse Serverless would present a consolidated view across all files. We were impressed by the capabilities and performance of Synapse Serverless.

Another ADF process would extract the data from the Synapse Serverless views and load the data into a data warehouse hosted in a Synapse SQL pool. Although we considered other methods, such as using a linked server in a Azure SQL Managed Instance to Synapse Serverless to avoid data staging, we settled on Synapse SQL mainly for its scalability. Finally, a Power BI semantic model was created and Power BI Embedded used to deliver reports.

Benefits

The solution delivered the following benefits to LabCorp:

  • Data consolidation – Data was extracted with minimum impact to the operational systems and consolidated in ADLS.
  • Data virtualization – Thanks to the Synapse Serverless flexible support of schema differences, virtual views were created.
  • Scalability – Synapse SQL pool can scale almost indefinitely.
  • Secure and fast insights – Each vendor can access only their data.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

Prologika Newsletter Summer 2021

Business Performance Management (BPM) is a methodology to help the company predict its performance. An integral part of a BPM strategy is creating and monitoring a scorecard with Key Performance Indicators (KPIs). In this newsletter, I’ll discuss how the newly released Power BI Goals can help you augmenet your BPM strategy. But before that, I’d like to share my excitement that Microsoft have recently awarded me FastTrack Recognized Solution Architect – Power BI! I’m one of the 33 individuals worldwide who got nominated by the Power Platform product engineering team for consistently exhibiting deep architecture expertise and creating high quality solutions for customers during project engagements.

Introducing Power BI Goals

A vital BI practice for every organization, performance management ensures that important metrics, such as Key Performance Indicators (KPIs), meet established goals. The typical artifact to do so is implementing a scorecard: a report that compares the current state with the desired state of these metrics. You might have also heard the term “balanced scorecard” which is an organization-wide scorecard that tracks several subject areas, such as Finance, Customer, and Operations. In the past, organizations would use different tools, such as the now deprecated PerformancePoint (included in SharePoint Server) to implement balanced scorecards. Realizing the importance of scorecards, Power BI introduced Goals that aim to simplify the process of implementing departmental and organizational scorecards. For more information on how Goals works, watch the “Goals in Power BI” presentation from the Microsoft Business Application Summit

What’s to like
As with anything Power BI, Microsoft has democratized scorecards so business users with no reporting experience can quickly assemble them from existing reports. Think of a goal as a line (or KPIs) in the scorecard. Here is scorecard with two goals:

Currently, Power BI supports two goal types:

  • Static – The goal creator manually enters and track the goal properties, such as current value, target value, and status. This could be useful for quick and dirty KPIs that are not backed by a data source, such as launching a new promotion campaign. In the scorecard above, I created the Revenue goal by entering 5M as the current value and 5.5M as the goal.
  • Data-driven – The goal current value and/or target value can be data-driven and bound to metrics from existing report(s). Coming from Analysis Services, I was initially surprised that Power BI doesn’t require implementing KPI measures, but I get it: Microsoft decided to source the metrics from reports so business users can easily apply filters. If the goal owner chooses a metric from a visual that has a Date field, such as a time series chart, Power BI automatically shows a sparkline for the goal progress over time. And, of course, when the report dataset is refreshed, the goal values are updated.

So, no modeling or Power BI Desktop required assuming that someone else, such as a data analyst, has delivered functional reports with the metrics. Even better, the goal current and target values can come from different reports (even a report in a different workspace if you have permissions), e.g. a report with actuals and another report with targets. So, there is plenty of flexibility here. To mimic a balanced scorecard that spans multiple subject areas, the owner can create subgoals. For example, the main goal could be Finance with subgoals Revenue, Margin, etc. Because like dashboards, goals are “pinned” from reports, the end user can navigate to the underlying report to examine the data in more detail. Users can also add notes to explain the goal behavior to the teammates.

A scorecard is a first-class Power BI citizen, and as such, it can be secured, endorsed, secured with sensitivity labels, annotated, and shared, such as sharing the scorecard to a Microsoft Teams channel. The scorecard data is saved in a Power BI dataset that users can connect to build custom reports. Moreover, Power BI automatically adds daily snapshots to the dataset allowing users to build up a history of the goals. For example, if the underlying report is refreshed daily, the updated goal values will be appended to the dataset. Developers can use the Power BI REST APIs to implement programmatic scorecard management solutions.

What’s not to like

Besides navigating to the underlying report, a goal is a one-liner in the scorecard. I can’t define a goal that shows me a metric sliced by dimension members, such as business unit. Further, subgoals are not currently aggregable, such as to sum or average values when rolling up to the main goal. Like limitations with dashboards, there is no way to apply a global filter to the scorecard, e.g. to filter all goals for the prior month.

Besides current and target values, no other goal properties can be data driven. For example, unlike Analysis Services KPIs, the goal status can’t be currently bound to a DAX measure. Changing the status requires proactive manual “check ins” although Microsoft mentions a forthcoming feature that will let users define rules to change the status, like how you can define rules for dashboard tile alerts. Speaking of data-driven properties, I don’t understand why you must use a date field to get the progress as opposed to any other field, such as Month, in your Date table.

Finally and unfortunately, Goals require Power BI Premium. If we really want to democratize features, shouldn’t we make them available in Pro?

Conclusion

Goals are a Power BI Premium feature aimed at making it easier to create scorecards and monitoring metrics from existing reports. They promote a “bottom-up” culture, where business users can create departmental scorecards without reliance on IT. Microsoft plans more features by the end of the year to make Goals more appealing, such as integration with Power Automate to trigger actions, rolling up subgoals, changing the goal tracking cycle (DoD, MoM, YoY), custom goal formatting, Power BI Mobile experience optimized for phones, providing a scorecard visual, and cascaded goals (hierarchy of goals).

If you find Power BI Goals somewhat inflexible or you don’t have budget to upgrade to Premium, you don’t have to use the Goals feature to implement scorecards. You can define KPIs and create dashboard-looking reports where you have complete control over the scorecard presentation.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

Prologika Newsletter Spring 2021


Amidst the COVID pandemic, the Houston Health Department (HHD) had another predicament to tackle. With lab results accumulating rapidly at one million cases per month, the vendor system they used for capturing and analyzing COVID data couldn’t keep up. In this newsletter, you’ll learn how Prologika implemented a BI solution powered by SQL Server and Power BI to solve technology challenges, and deliver fast and reliable insights.

Business Challenges

The vendor SQL Server database had large tables with normalized name-value pairs for each question and answer received from the patient, and for each investigation result. To facilitate reporting, the vendor system had scheduled ETL processes to extract data from the OLTP tables into denormalized tables. However, locking conflicts and large data volumes would crash the system and fail the ETL processes.

As a result, business analysts were not able to get timely data for reporting. HHD and the vendor tried to isolate the reporting loads by replicating the data to a reporting database but the issue with populating the denormalized tables remained

Solution

A good solution starts with a solid foundation. After assessing the current state and objectives, Prologika recommended and implemented the following architecture:

The stand-alone SQL database was replaced with an AlwaysOn availability group. Not only did this provide high availability, but it also isolated operational from reporting workloads.

In the client’s own words “we have compared the cluster server to the report server and cluster is vastly superior with regard to performance for regular queries. One simple run was 4x faster on cluster than the current report server. A much more complex run took four minutes on cluster and I stopped the same run on the report server after 87 minutes.”

Previously, data analysts would produce reports using different tools, ranging from SQL, Python, to Power BI reports. With scarce resources, HHD found it difficult to continue that path. Instead, Prologika implemented a semantic model that was hosted in Power BI.

Benefits

The new architecture and semantic model delivered the following benefits to HHD:

  • A single version of truth – Strategic data and business calculations were centralized on one place.
  • Fast insights – The Power BI reports crunching millions of rows were instantaneous.
  • Isolation – Report loads were isolated from the operational loads because the data was cached in the semantic model.
  • Standardization and centralization – Power BI became the reporting platform for delivering insights across the organization.

 


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

State Health Department Gains Reliable and Rapid COVID Insights

Amidst the pandemic, the Houston Health Department (HHD) had another predicament to tackle. With lab results accumulating rapidly at one million cases per month, the vendor system they used for capturing and analyzing COVID data couldn’t keep up. The SQL Server database had large tables with normalized name-value pairs for each question and answer received from the patient, and for each investigation result. Read our case study to learn how Prologika implemented a BI solution powered by SQL Server and Power BI to help HHD gain reliable and timely insights from COVID lab results.

SQL Server 2019 Installation Woes

Now that SQL Server 2019 is officially here, I was eager to try it out. My upgrade/reinstall experience ran into several issues that I thought might be worth sharing:

  1. Upgrade/reinstall fails with “An error occurred for a dependency of the feature causing the setup process for the feature to fail.” With no indication what dependency failed. I solved it by extracting the iso file into a folder instead of mounting it.
  2. Analysis Services Tabular fails to start with “An error occurred when loading the ‘ASSP’, from the file, ‘\\?\D:\MSSQLSERVER\SSAS\Data\ASSP.0.asm.xml’.” This error was caused by changing the Analysis Services Tabular default folders to another drive. It could be related to my setup, such as permissions granted to that drive. I fixed it by leaving the SQL Server and Analysis Services default folders.
  3. The Polybase services can’t start and show perpetually “Starting” in the Windows Services applet. Consequently, the SQL Server Database Engine can’t be stopped. This was solved by enabling the TCP/IP protocol in the SQL Server Configuration Manager.
  4. Still an open issue although not a blocker – Any right-click action in SSMS connected to Analysis Services Tabular results in a significant delay.

Presenting at SQL Saturday Atlanta on May 18

SQL Saturday Atlanta next Saturday might break the attendance record with over 900+ people registered! Kudos to the organizers. It’s not easy to organize an event of this size with a modest budget. I’ll present “What’s New for BI in SQL Server 2019” at 1 PM.

“SQL Server 2019 delivers important data analytics features and enhancements. Join this session to discover what BI features are coming up in Database Engine, SSIS, SSAS, SSRS (Power BI Report Server), and MDS, and to see some of these preview features in action.”

Taking the Most Out of Linked Servers

What’s the first thing that comes to mind when you hear “linked server”? Performance issues or a snail image, right? Many developers don’t like linked servers because they are “slow” although they rarely bother to investigate the underling issue, much less to fix it.

Linked servers are a very useful and commonly used feature of SQL Server. They allow SQL Server to query external data sources, such as Oracle and DB2, and to present these data sources as SQL Servers to upstream systems. A query that involves a linked server is called a distributed query. The most common issue with a distributed query is that it doesn’t “remote”. In other words, while you expect the entire query (or parts of it) to be executed on the remote server, SQL Server downloads the data locally and then performs filters or joins. This could be very expensive with large tables,  hence the DQ notoriety. Conor Cunningham (Partner Software Architect, SQL Server Engine at Microsoft) covered this topic very well in his “Distributed Query Deep Dive” presentation a while back. I just want to add a few tips and tricks harvested from a recent project.

ETL developers typically implement incremental extraction to reduce data movement between systems. It’s a best practice. For example, when loading sales transactions, the query might ask for data since the last date the data was captured in the data warehouse. Here is what the distributed query might look like (note that 4-part reference notation):

select <columns> from [<linkedserver>].[<database>].[<schema>].[<table>] WHERE InvoiceDate>=‘2018-10-20’

Unfortunately, as simple this query is, it won’t remote as you can witness by the performance degradation and the query plan (the screenshot below uses Live Statistics):

Starting from right to left, we can see that the remote query fetches all rows from the target table before it applies the filter. In fact, unless you tweak the linked server properties, this query won’t remote if it filters on any type of column, not just date. However, you might be able to get a query with filters on other data types (except date) to remote if you set Collation Compatible to True on the linked server properties in SSMS, assuming of course that SQL Server and remote server use the same collation.

EXEC master.dbo.sp_serveroption @server=N'<linkedserver’, @optname=N’collation compatible’, @optvalue=N’true’

What’s so special about dates that prevents SQL Server from remoting them? Historically, date remoting was not done because of potential issues like different time zones and clock skews between the two servers. Unfortunately, SQL Server doesn’t give the developer the latitude to ignore such issues and force a distributed query to remote (I’ve tried many permutations of passing the date parameter with no luck). You can check the OLE DB Provider properties by enabling the OLEDB Provider Information event in the SQL Server Profiler to see if the provider implements SQLPROP_DATELITERALS. In my case, both the Oracle OLE DB Provider and Microsoft OLEDB Provider for DB2 didn’t implement this feature and I discovered this by the following property in the event results:

<DateLiteralsSupport>0</DateLiteralsSupport>

So, the only way to remote the query was to convert it from a distributed query to a pass-through the query either using OPENQUERY or EXEC AT.

select <column list> FROM OPENQUERY(<linkedserver>‘SELECT * from <database>.<schema>.<table> where INVDT >= ”2018-10-20”’)

EXEC (‘SELECT TOP 10 * FROM <database>.<schema>.<table> where INVDT >=?‘, ‘2018-10-20’) AT <linked server>

The advantage of the EXEC AT method is that the query can be more than 8,000 characters and you don’t have to concatenate parameters. This method requires that you first set the RPC and ‘RPC Out” properties of the linked server to True.

exec sp_serveroption @server='<linkedserver>’, @optname=’rpc’, @optvalue=’true’

exec sp_serveroption @server='<linkedserver>’, @optname=’rpc out’, @optvalue=’true’

This blog discussed some reasons why SQL Server might not remote even simple distributed queries. I recommended Microsoft make linked servers more flexible to support such common scenarios. It will be interesting to see how the forthcoming SQL Server 2019 external tables that use PolyBase will play out. They are expected to be more optimized for throughput than for OLTP operations. So, for OLAP queries we may get better performance in comparison with linked servers.