Power BI What-if Analysis

Veteran Excel users might have used the Excel What-if feature that let you try several different sets of values in one or more formulas to explore various outcomes (scenarios). The August release of Power BI Desktop introduces a similar feature. You start by defining a What If parameter from the Modeling tab.

Don’t confuse the What If parameter with a query parameter which is used in query parameter-driven properties, such as to change the data source the query connects to.

In the process of configuring the parameter, you define hardcoded minimum, maximum, and increment values. The outcome is two DAX measures. The first one generates the scale, while the second returns the selected value.

Discount Percentage = GENERATESERIES(0, 0.5, 0.05)

Discount Percentage Value = SELECTEDVALUE(‘Discount Percentage'[Discount Percentage])

 

Next, you can use the “Value” measure as any DAX measure. Typically, you would create a target measure that somehow depends on the What If parameter, e.g.:

Sales After Discount = SUM([SalesAmount]) – (SUM(InternetSales[SalesAmount]) * ‘Discount Percentage'[Discount Percentage Value])

Now you can see how the target value changes when you change the What If value.

Power BI What If parameters allow you to parameterize you DAX measures. Continuing on the same path, I’d like to see expression-based properties and filters, as well as the ability to parameterize fields on the report, such as to present the user with a list of measures, and rebind visualizations after the user selects which measure they want to see on the report.

Power BI vs. Pyramid

I’ve never thought I’d see this one coming but today I got a marketing email from Pyramid Analytics (previously an ardent Microsoft partner) citing a head-to-head comparison with Power BI. Et tu, Brute?

“Let’s start with a clear pricing model – one that doesn’t require an online calculator to figure out annual costs. Let’s add in fundamental capabilities and features like dashboarding, KPI’s, dicing, time intelligence, parameterization, and asymmetric reporting. Let’s close out with a complete lack of narrative reporting capabilities. The Pyramid Analytics platform provides an enterprise class on-premises BI solution that delivers all of the above plus simple advanced analytics. It’s available now and at a price you can easily understand. There’s a clear choice. Click here for a head-to-head comparison and a limited time offer”

I pity any vendor that competes with Power BI, especially the ones that compete “head-to-head”. Pyramid must feel the heat to come up with this “The Choice is Clear” marketing banner. But to me it’s not very clear and to keep ’em honest, their costing comparison is fabricated. According to Pyramid, you have to cough up $66,000/year for 100 users. But wait, the asterisks below tells us “*If data becomes too large for a single P1 node, additional nodes can be purchased for an additional $60,000/yr”. The choice is clear, right?

081617_1705_PowerBIvsPy1.png

Actually, it’s not and you don’t need the online calculator. It’s true that if you go with Power BI Premium, you’d pay $66,000 but why would you for 100 users (or 50 viewers to be more accurate)? If you choose instead the per-user pricing model ($10/user/mo), then your cost would be $12,000 per year not considering any additional discount you might get from Microsoft. So, that’s $3,000 lower than Pyramid. And, you’ll get a cloud solution that improves every month, plus a free Power BI Desktop tool. Or, if you want to save even more, you can go on-prem by deploying to Power BI Report Server which might be covered by an existing SQL Server Enterprise Edition license. Then, your software cost might be zero, plus one or more Power BI Pro $10/mo licenses for report authors.

Aside the Power BI Premium-only features, from a pure costing perspective switching to Power BI Premium makes sense above 500 users. It’s simple math.

About the second “Head-to-Head Comparison” page that shows that Pyramid outshines Power BI in every possible way, it doesn’t mean much without substance. I’m looking forward to the extended banner with more detail that explain how Pyramid exceeds the Power BI dashboarding, KPI, dicing, time intelligence, parameterization, and asymmetric reporting (whatever this means) capabilities. As far as “complete lack of narrative reporting capabilities”, this is not true either. Power BI has a Narative Science custom visual and Pyramid should know this.

Creating PBI Detail Reports

Related to the previous blog, let’s show how to create a detail report in Power BI that mimics an Excel PivotTable report in a flattened layout. Consider the report below. To create it, use the Matrix visual and add the necessary fields to the Rows bucket and measures to the Values bucket. By default and similar to Excel, Matrix would use a Stepped layout so that each time you expand a field, the next field would use the same column to minimize horizontal space. In the Format tab, in the “Row headers” section there is a setting “Stepped layout” that is easy to miss. When you switch it to Off, Matrix is now configured to mimic the Excel Tabular (flattened) report layout. Now right-click on the report each field that you want to expand, and then click “Expand to next level”.

081217_2015_CreatingPBI1.png

There are currently two limitations that I hope will be lifted soon. There isn’t an option to expand all fields in one step so you have to expand one field at the time. More importantly, currently there isn’t a way to disable specific totals. The “Total row” option in the Format tab toggles on/off all subtotals. Having report subtotals might lead to a performance issue (see my previous blog for more detail). UPDATE 8/14/2017 – The latter limitation has been lifted in the August release of Power BI Desktop, which now supports configuring row subtotals per level (turn the “Per row level” to on in the Subtotals section of the Format tab) .

Considerations for Detail Reports

Nobody likes watching a report spinny. Interactive detail reports that perform well from an Analysis Services semantic layer have been the bane of my BI career. A “detail report” is a report that requests data at a lower level, e.g. policy in the insurance business, customer in Sales, etc. A detail report typically has many dimension attributes, eustomer Name, Account Number, Product Name, Product Number, etc. And, the more columns you add, the slower the report gets. The reason why such reports don’t typically perform so well when generated from a semantic layer is that Analysis Services is not SQL Server.

Multidimensional is an attribute-based model and the server cross joins the member values when you add attributes to the report. Don’t be misled by the “relational” nature of Tabular either. Its database engine (xVelocity) is an in-memory columnar database that still cross joins the column values. That said, Tabular should give you a performance boost for two reasons. First, its in-memory nature is generally faster than Multidimensional. Second, Excel has been optimized for Tabular, as I explain in my “Optimizing Distinct Count Excel Reports” blog, thanks to the undocumented PreferredQueryPatterns settings (although not listed in the msmdsrv.ini, it defaults to 1).

In a recent project, I migrated a customer from Multidimensional to Tabular. Besides other benefits, such of elimination of snapshots, reducing dramatically ETL time and analyzing data as of any date (not just at the month end), the customer wanted to improve performance of Excel detail reports. Previously, some of the detail reports would never return. After the migration, these reports would execute within a minute, and in seconds if the Excel subtotals are disabled.

Here are some tips you might find useful if you’re tasked to produce detail reports:

  1. If the detail reports don’t require too much interactivity, consider implementing them as SSRS paginated reports that connect directly to the database. The chances are that this approach will give you the fastest performance as the Database Engine is designed to retrieve data in sets by rows and the number of columns won’t probably affect report performance (unless of course many joins are required).
  2. If it’s desired to connect these reports to a semantic layer, consider Tabular because it’s better for wide & flat results.
  3. If Excel is used as a front end, consider Excel 2016 as Microsoft has made various performance improvements for detail reports.
  4. Consider disabling Excel subtotals, as explained here. In my scenario, a detail report with no subtotals would execute under 15 seconds. However, if I enable just one column subtotal, Excel switches to a completely different query pattern (with many nested DrilldownMember levels) and the report query would take a minute. That’s because now the query needs to obtain the subtotals for each group from SSAS. Since we’re at the mercy of the Excel MDX query generator, I hope the Excel team finds a way to produce more optimal MDX queries. Ideally, a future Excel release would allow binding Excel native tables directly to SSAS with ability to define subtotals and optimized queries to load the data.
  5. In my experience, Power BI reports that generate DAX don’t perform necessarily any better. To make things worse, while the new PBI Matrix visual can be configured to a flattened layout, it doesn’t currently support disabling specific subtotals (currently, you can remove all row subtotals or have them for each and every column). And asking for subtotals for every column might not only contradict business requirements but it could also severely affect performance. UPDATE 8/14/2017 – The August release of Power BI Desktop supports configuring row subtotals per level.

Many thanks to Akshai Mirchandani for the SSAS product group for not losing patience throughout all these years from my complaints on this subject.

081217_1845_Considerati1.jpg

MVP For 14 Years!

Got awarded for Microsoft Most Valuable Professional (MVP) – Data Platform again. This will make my 14th consecutive year to be recognized by Microsoft for my expertise and contributions to the community! Apparently, it became difficult (or not cost effective) to count years, so the ring is for 2017-2018 but the tower doesn’t lie.

What’s ROI for BI?

I often ask prospects if they’ve done any estimates of the value that the envisioned BI solution would bring to them. The Microsoft Solution Framework actually recommends partners do this during the planning phase. Alas, the answer is almost always negative. Then I ask what criteria would be used to evaluate if my price (I give fixed value-based prices as I don’t sell time) would yield a good ROI and then there is an awkward pause. Instead, the buyer typically uses a top-down approach for selecting a consultant. The buyer either has some budget constraints and solicits multiple quotes to find the “best” fit. There is nothing wrong with this, of course, given that ROI for BI is notoriously difficult to determine.

Consultancies and advisory firms have some sort of a “seat-of-the-pants” methodology to help the customer come up with ROI if this is deemed important. But you might not have to go through this exercise as research companies has done the math for you. For example, Nucleus Research has estimated that BI delivers 1300% ROI!

“In analyzing Nucleus ROI case studies on analytics, we found the average returns from analytics have been increasing, reaching $13.01 for every dollar spent in 2014 from just $10.66 in 2011.”

Often, the benefits of an effective and insightful BI solution can only be realized after the customer has started using the solution. I’m always delighted when customers report that our solution has paid for itself in the first year or so. For example, an insurance company that has found that customers underreport their risk exposure and justifiably increased premiums. Or, another client that has found that data analytics transformed the organization’s data into a key strategic business asset, empowering employees like never before.

What’s your ROI for BI?

Running ETL on AlwaysOn Primary Replica

Scenario: Consider a SQL Server configured for AlwaysOn Availability Groups where a set of databases can fail over to another node. You have SSIS jobs and you created them on both nodes. However, you need to check if the current node on which the job is running is the primary node. If you don’t do this, the job fails because the database isn’t accessible.

Solution: In every SQL Server Agent Job, add a first step to check if one of the replicated databases in the AlwaysOn availability group is the primary replica:

IF
sys.fn_hadr_is_primary_replica
(‘<replicated dataset name here>’)
= 0

BEGIN


EXEC msdb..sp_stop_job
N’DW Daily Load’;

END

This script checks if the database is a primary replica. If this is not the case, the script stops the next step, which in this case is the “DW Daily Load” step.

Power BI vs. Qlik

After Tableau and SiSense, Keeping ’em Honest continues with Qlik. Now, after they sold the company and took a hit in the latest Gartner’s quadrant, I didn’t expect much of a completion from Qlik but every now and then I run into a customer considering Qlik and willing to share Qlik’s feedback on how Qlik outshines Power BI in every possible way (can’t share these documents due to NDA although they are interesting and entertaining stories). But to be fair to Power BI, I’ll enumerate next a few reasons that customers have shared of why they prefer Power BI over Qlik and from my experience in in helping several customers transition to Power BI from Qlik. I’ll focus on Qlik Sense, which is the Qlik’s latest tool that competes head to head with Power BI.

Chasing the perfect tool (hint: there is none)? Instead, focus on architecture and data quality. Tools come and go, and they tend to leapfrog each other. Data quality and good foundation stays on.

Data Acquisition

  1. Connectors – the data journey starts with connecting to your data. According to Qlik’s documentation, it offers a very small set of connectors. By contrast, Power BI supports over 70 connectors and the list growing every month.
  2. Content packs – In Power BI, content packs delivered by Microsoft partners allow business users to get pre-packaged reports and dashboards from popular cloud services, such as Dynamics CRM, Salesforce, Google Analytics, Marketo, and many more, directly in Power BI Service (no need to use Power BI Desktop). No such a feature exists in Qlik.
  3. Complexity – One prevailing theme I hear from Qlik users is complexity where the tool requires the end user to be a developer to create scripts and import data. In fact, I know of a large organization in Atlanta which has hired a consulting company to create and maintain Qlik reports. So much about the self-service BI story. By contrast, Power BI Service and Power BI Desktop include user-friendly wizards.
  4. Data transformation – In Power BI Desktop, a data analyst can use the Query Editor to define repeatable steps to transform and cleanse the data before it’s imported in the data model. The Query Editor (Power Query in Excel) resonates well with business users. No such a tool exists in the Qlik world

Data Model

  1. Associative data model – Qlik continues to tout the “associative data model” (learn more in this video and its related videos on YouTube). Besides the self-service vs organizational BI propaganda, nothing new here that Power BI can’t do. As far as the extolled virtues of the “associative data model”, I couldn’t agree more with what Donald Farmer (a former Vice President at Qlik) had to say a while back. “Talking of engines, some have been misled to believe that QlikView’s supposed “associative analysis” represents some significant engine smarts. I have even heard analysts very misleadingly say that QlikView has “association rules” – implying some kind of data mining, such as Microsoft implements in its Data Mining server and Excel Add-ins. QlikView add to the confusion by talking about an associative “architectural model.” However, despite the hype, as Curt Monash points out (or rather, painfully extracted from QlikTech themselves through a long thread of comments) it is not so: “The associative aspect is really more meaningful in describing the end user experience, in that you see visually what is associated and is not associated with any particular selection or drilldown.” As Curt says, “Thank you for admitting that clearly!!! It wastes a fair amount of analysts’ time when your company pretends otherwise.” So, the associative model is nothing more that the relationship auto-discovery in Power BI Desktop.
  2. Data model capabilities — I said this many times and I’ll say it again – no other tool on the market comes even close to Power BI as far as its modeling capabilities which allow business users to create self-service data models that are on a par with models created by BI pros.
  3. Calculations – In Power BI, we use DAX, a language that is far more powerful than Qlik expressions.

Platform

  1. Ecosystem — No matter how good it is, a self-service visualization tool addresses only a small subset of data analytics needs. By contrast, Power BI is a part of the Microsoft Data Platform that allows you to implement versatile solutions and use Power BI as a presentation layer. Want to implement a real-time dashboard from data streams? Azure Stream Analytics and IoT integrates Power BI. What to show reports on the desktop from natural questions? Cortana lets you do it by typing questions or voice. Want to implement smart reports with predicted results? Power BI can integrate with Azure Machine Learning? Want to publish SSRS and Excel reports alongside interactive reports? Power BI supports this. Expect this strength to increase as Cortana Analytics Suite and prepackaged solutions evolve.
  2. Scalability and continuum – Despite Qlik’s scalability white papers that claim to scale to billions of rows, a desktop tool can get you up to a few million rows. Even if its technology can scale to much bigger data volumes, working with gigabyte files is no fun. Not to mention that you’d probably wouldn’t want millions for rows imported and exported all the time. At some point, you have to consider moving your models to a dedicated server. In MS BI, you can transition from Power BI Desktop to Tabular on the server. Same technology, same interfaces. In fact, I tell students that if you know Power BI Desktop, you already know 80% of Tabular. What’s Qlik’s continuum story?
  3. On your terms – Power BI lets you deploy your report to the cloud Power BI Service and on premises to the Power BI Report Server. Qlik is predominantly on-premises solution with a fledgling cloud offering. Besides, Qlik Sense Cloud is more expensive at $25/month vs $10/mo in Power BI (if you decide to go cloud).

Analytics

  1. Natural language – No equivalent to Power BI Q&A which allows business users to ask natural questions. Qlik has a basic story telling via the Narrative Data Science and so does Power BI.
  2. Advanced analytics – No integration with R or built-in machine learning capabilities, such as Quick Insights in Power BI.
  3. Real-time dashboards – Qlik doesn’t have the API to allow developers to push data to a dashboard that updates in real time.
  4. Geospatial reporting – Qlik’s geospatial reporting is not as robust as Power BI. For example, Power BI offers at least four mapping visualizations, including ShapeMap (supports custom shapes) and ArcGIS

Openness

  1. Reporting tools – Reporting tools come and go. If make a decision to go with Qlik, can you use another reporting tool to connect to it? By contrast, in Power BI you can analyze your data with any DAX or MDX capable tool, such as Excel.
  2. Custom visuals – Qlik claims that it can integrate with D3.js to allow developers to create custom visuals. In reality, Qlik extensions basically spit out HTML. By contrast, Power BI custom visuals adhere to predefine interfaces. What this means for a developer is that it’s much easier to implement and test custom visuals in Power BI.

Other

  1. Cost – I’ll let you do your own math here but the number one reason customers gravitate to Power BI is cost.
  2. Speed – Power BI gets new features every month. How often is Qlik updated?

pbivsqlik

Prologika Newsletter Summer 2017

What Does Power BI Premium Mean for You?


060417_1725_PrologikaNe1.pngAs I mentioned in my Spring newsletter “15 Reasons Why Microsoft is Leading the BI Space”, Microsoft BI is enjoying resurgent interest from customers and industry observers, and Power BI gets lots of attention. I’m sure that you’ve also heard the announcements about Power BI Premium – a new dedicated licensing model for Power BI. In fact, Power BI Premium is so important to Microsoft that it’s positioned as a new product under the Power BI umbrella name instead of a new licensing model. In this newsletter, I’ll discuss what Power BI means for you and why you should strongly consider it. If you live in these cities, consider joining me at SQL Saturday Chattanooga on June 24th or the Atlanta MS BI Group on June 26th when I’ll present “Understanding Power BI Premium” and discuss its inner workings in more detail. My June calendar has the details about these events.


Why Power BI Premium?

According to Microsoft, for less than 2 years since Power BI became generally available, more than 5 million users are currently using it. As the product matures, the following growing pain points have emerged:

  • Per-user licensing model – Large organizations find it difficult to license Power BI per user. A case in point – one year after a successful Power BI hybrid pilot, a Fortune 100 organization has purchased whopping 5 Power BI Pro licenses. There are several reasons for the slow adoption by large companies but one of them is the per-user license.
  • Cost prohibitive with large user base – The per-user licensing model doesn’t include a “reader” license. If a report has a Power BI Pro features, all users accessing reports would need Power BI Pro licenses. So, if you a report that used Power BI Pro features, such as gateways or live connections, and you won’t this report to be available to 1,000 users, you had to foot $10,000/month bill because everyone required Power BI Pro.
  • Performance – Power BI Service is a shared cloud environment. Although no one has shared performance issues with me, it’s possible that the performance of your reports might be negatively affected by other organizations.
  • No “premium” features – When I teach Power BI I often get feedback that some missing features, such as caching and incremental refresh, are important. This goes back to performance but the reason why they are not currently available is because they might impact performance in a shared environment.
  • Confusing licensing – Previously, there was some overlap between Power BI Free and Power BI Pro. Many organizations find it very difficult to navigate what features are available where.

What’s There to Like in Power BI Premium?

There are several reasons why Power BI might be appealing to your organization.

To start with, larger organizations can save on licensing cost. If all users would need access to reports with Pro features, the breakpoint at which Power BI Premium reduces cost is above 500 users. That’s because, Power BI Premium allows Power BI Free users to access Pro content without Power BI Pro licenses. You can use the Power BI Premium calculator to experiment with what-if scenarios and calculate your licensing cost.

Cost savings is not the only reason to consider Power BI Premium. You also get dedicated capacity. When you purchase Power BI Premium, you buy a dedicated environment for your organization. Think of it as buying a dedicated IaaS service, such as an Azure VM, versus a PaaS service, such as an Azure SQL Database, that is shared with other users on the same server. For more information about how this works and what Premium plans are available, read the Power BI Premium whitepaper.

Almost organization I’ve consulted with has the need to embed reports for a third party. Previously, if you want to embed Power BI reports, you had to use the Azure Power BI Embedded Service, which was separate from the Power BI Service. Consequently, Power BI Embedded had to catch up with Power BI Service. For example, it still doesn’t have connectivity to on-premises data sources. The good news is that Power BI Embedded marries Power BI Service (Power BI Premium = Power BI Service + Power BI Embedded) so there will be feature parity. If you are concerned about the increased licensing cost (Power BI Embedded has an hourly-based session licensing model), Microsoft unveiled low-cost Power BI Premium plans targeted to report embedding only and starting at $625/month.

powerbipremiumplans

Another appealing detail concerning embedded analytics is that unlike Power BI Premium which requires reports authors to have Power BI Pro licenses, creating new embedded reports or changing existing embedded reports by external users don’t require a Power BI Pro license. Of course, if your organization purchases a Power BI Premium plan, such as P1 starting at $5,000/month, not only will you be able to use Power BI Premium for internal use, but you can also embed reports for external customers. In other words, one Power BI Premium license covers both internal and external reporting.

Finally, Power BI Premium-only features are in the works, such as real-time caching by pinning datasets to memory, incremental dataset refresh, secondary model copies for disaster recovery, and geo replicas to reduce latency by replicating instances around the world.

Changes to the Power BI Portfolio

You might wonder how Power BI Premium impacts your current Power BI usage. The table below shows how Power BI Premium changes the Power BI portfolio.

TodayFuture
Power BI DesktopConnect to 70+ data sources

Data transformations

Report creation and exploration

No changes
Power BI FreeNo live connections, No gateway connectivity

Smaller capacity limits and data refresh rates

Only simple dashboard sharing

Power BI Pro feature parity
No sharing (not even simple dashboard sharing)
Power BI ProAccess to all data sources

Larger capacity limits and data refresh rates

All sharing options (simple, workspaces, org content packs)

No changes
Power BI PremiumIncreased capacity limits
Dedicated environment
Content distribution (reader license)
Power BI Report Server
More features in future (read the whitepaper)

So, don’t panic! You can continue using Power BI under the current per-user licensing model (Power BI Free and Power BI Pro) if Power BI Premium makes no sense to you. Another good news is that Power BI Free now has all the Power BI Pro features, so hopefully we won’t see “This report requires Power BI Pro license” anymore. The bad news is that with per-user licensing, Power BI Free is for individual use only. A Power BI Free user can’t share any Power BI content (not even simple dashboard sharing). And, unless you purchase Power BI Premium, content can’t be shared with Power BI Free users (Both inbound and outbound sharing are not available for Power BI Free users).

Power BI Report Server

Delivering on its on-premises reporting roadmap, Microsoft has extended SQL Server Reporting Services (SSRS) to support Power BI reports. The extended SSRS will be known as Power BI Report Server. Moreover, Microsoft has decoupled SSRS from SQL Server so it gets more frequent updates. Moving forward, SSRS becomes two products:

  • SSRS – This is the SSRS we know it but with no Power BI integration. It will get new RDL features but no Power BI integration features.
  • Power BI Report Server – It gets both RDL and Power BI features. As far as the reason for the name change, the Power BI name is a strong brand while SSRS has been associated with the old style paginated reports.

You can get Power BI Report Server in two ways:

  • As a part of the Power BI Premium plan where you get the same number of licensed on-premises cores as the v-cores you purchased within your Power BI Premium plan. You can use these licensed cores only for Power BI Premium Report Server (SQL Server requires a separate license if you install it on the same box as Power BI Premium Report Server).
  • Standalone and covered by a SQL Server Enterprise Edition with Software Assurance license. Currently, SQL Server doesn’t check for Software Assurance in any way (there isn’t such SKU). So, Power BI Premium licensing would be an honor system for customers who want to get it standalone (with SQL Server Enterprise Edition licenses).

So, although Power BI Premium Report Server has divorced SQL Server, you can still cover by the SQL Server license (kind of when you send your kid to college but she still lives with you). In both cases, developers publishing Power BI reports to the Power BI Report Server will require a Power BI Pro license at $9.99/user/mo.

MS BI Events in Atlanta

As you’d probably agree, the BI landscape is fast-moving and it might be overwhelming. If you need any help with planning and implementing your next-generation BI solution, don’t hesitate to contact me. As a Microsoft Gold Partner and premier BI firm, you can trust us to help you plan and implement your data analytics projects, and rest assured that you’ll get the best service. Regards,

Teo Lachev

Teo Lachev President and Owner
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

Scaling out SSIS in SQL Server 2017

SQL Server 2017 introduces the ability to scale out SSIS. The primary scenario is to enable customers to scale out their SSIS execution at the package level. Imagine a retail business that has about 300 packages to run at the end of the day to push transactional data/finance data from multiple stores to the centralized data warehouse for reporting. All the 300 packages need to run at night but they also need to finish by 6 AM in the morning before new business day starts. As business grows, data size grows, and it takes longer time to run the 300 packages and it is becoming more difficult for the ETL processing window to finish before 6 AM. With SSIS scale out feature, you can now set up the scale out cluster with 1 master and N workers, so that the master can automatically distribute the packages to the workers based on the worker availability, therefore, it helps parallelize the execution at package level and achieve faster completion time. You don’t have to worry about which packages will be executed by which worker, as it’s handled by the master. All you need is to deploy the 300 packages into the catalog and then trigger the ETL run.

Sounds like an easy fix to your long ETL processing times? As with any technology, my advice is to focus on improving the performance of your ETL first, then scale up, then scale out. Here are the top 5 performance issues that I see over and over in my consulting practice:

  1. No parallelism – Packages run sequentially, e.g. dimensions are populated one by one and then fact tables one by one. Meanwhile, despite all the CPU power and cores you fought hard to secure, the server isn’t doing much. Instead, you should run things in parallel as much you can with the goal to saturate that server CPU bandwidth. Ideally, you should invest into a framework that can automatically distribute work across packages with on a configurable degree of parallelism, as the one that we use and is mentioned in my newsletter ” Is ETL (E)ating (T)hou (L)ive?”.
  2. No incremental extraction – The less data you process, the faster your ETL will be. You should strive to extract only the rows that have changed if possible, using techniques such as LastUpdated timestamp, Temporal tables, or CDC.
  3. ETL instead of ELT pattern – Influenced by Microsoft and expert advice, you’ve decided to use the data flow transforms. For example, in a recent ETL review, I found that a Lookup task caches all the 50 million rows from a fact table in memory to determine a row should be inserted or updated! What happens when (not if) one day the fact table swells to 500 million rows? Instead, almost always the ELT pattern that relies on stored procedures and T-SQL MERGE would be a better choice from a performance standpoint. 30 years went into improving and evolving Microsoft SQL Server. It would be naïve not to take advantage of this evaluation and its set-based processing. And as a bonus, one day if you decide to migrate your DW to the cloud, such as by using Azure SQL Data Warehouse, you’ll find that ELT is recommended (unfortunately, Azure SQL DW still doesn’t support MERGE but one day I hope it will).
  4. Query optimization – Needless to say, your queries should be optimized. In the same recent review, I found that one query that extract data from ODS takes four hours to execute!
  5. Excessive data movement – This goes back to incremental extraction but I often see ETL that does full load or partial load (e.g. the last six months) and copies millions of rows from source to staging table 1, staging table 2, …, and finally data warehouse.

If, after you follow the above best practices, you still find that you’re exceeding your ETL processing windows, you should consider scaling out SSIS. Before doing so, consider the following limitations in the SSIS scale-out feature in SQL Server 2017:

  • A best practice is to partition your ETL process in child packages that are orchestrated by a master package using the Execute Package Task (EPT). SSIS scale-out can run packages with EPT but it does not scale out the EPT packages execution on another machine. If you have a main package, which as various sub package execution with EPT, the package and all EPT sub package execution will run on the same worker machine, while other workers in your scale out cluster can handle other packages that are independent. If some order/dependency is needed, there will be extra work needed from you to design a master package on purpose (i.e. one master package to control the overall flow and trigger execution in Scale-out and wait for the result accordingly).
  • If you opt for the ELT pattern as I suggested, scaling out might not help all that much. Your gain will depend on where the performance bottleneck is. If the bottleneck is already in the database, scaling out ETL across multiple nodes will be a futile effort. But if the bottleneck is not the database, then the scale-out can still help to take advantage of multiple machines.
  • When a package is scheduled in scale-out, by default it is possible for it to be assigned to any worker node to be executed. But if needed, you can also specify which worker node(s) you want the package execution to be assigned to when triggering the execution. The master node doesn’t monitor the worker utilization. Instead the worker node does its own CPU or Memory’s monitoring and tells the master if it can take more packages.