Posts

High Memory Usage and Calculated Columns

Scenario: A client reports a memory spike during processing. They have a Tabular semantic model deployed to Azure Analysis Services. They fully process the model daily. The model normally takes less than 50 GB RAM but during processing, it spikes five times and Azure Analysis Services terminates the processing task complaining that it “reached the maximum allowable memory in our pricing tier”. Normally, fully processing the model should take about twice the memory but five times?

high_memory_usageSolution: Upon expecting the model design, I discovered that the client has decided to add (many) calculated columns to the two fact tables in the model. Most of these columns are used to calculated variances to prior year. The formulas contain DATESYTD and other DAX date-related functions. After data is read, Tabular processes calculated columns, relationships and hierarchies. In this case, the spike was due to calculations involving large time ranges and ineffective DAX expressions. Converting these columns to measures resolved the issue.

As a best practice, abstain from using calculated columns (especially in fact tables). Make sure you understand the difference between measures and calculated columns (I cover this extensively in my latest book “Applied DAX with Power BI“). If you do need expression-based columns, such to materialize expensive calculations, consider defining them upstream, such as in SQL views or Power Query.

Prologika Newsletter Summer 2019

090918_1951_DAXVariable1.pngAfter learning how to model the data properly (the most important skill), DAX would be the next hurdle in your self-service or organizational BI journey. You won’t get far in Microsoft BI without DAX. This letter shares a tip on how to identify DAX performance bottlenecks by using a recent feature in Power BI called Performance Analyzer.

Learning DAX

DAXBefore I get to Performance Analyzer, I’m excited to announce my latest book: Applied DAX with Power BI: From zero to hero with 15-minute lessons. This book was born out of necessity. In my consulting practice, I had been teaching and implementing Power BI and Analysis Services Tabular, and people were constantly asking for DAX book recommendations. Indeed, DAX is not an easy topic and has its ways to humble even experienced practitioners. There are a few good reference books out there, but they could be somewhat overwhelming for novice users. So, I turned my classroom and consulting experience into this book and designed it as a self-paced guide to help you learn DAX one lesson at a time.

This one was a different book to write. Unlike my other books, which tend to be heavy, this book is smaller (220 pages). Regardless, it took me a while to write it. The book also adopts a lesson-based approach, so the reader can learn and practice a specific DAX concept quickly. There are usually three sections in each lesson: Practice (with step-by-step guidance), Outcome (a visual that shows the result), and Analysis (explaining why DAX works this way). Most lessons are five to six pages long, and it should take no more than 15 minutes to complete the lesson’s exercises (you can download and view a sample lesson from the book page).

And if you do one lesson per day, you’ll be a DAX expert in a month!

Analyzing DAX Queries

“This report is slow!” I hope you never hear this, but the chances are that you will sooner or later. No one likes watching a spinning progress indicator and waiting for the report to show up. Tucked in the View ribbon of the Power BI Desktop recent builds is a handy tool called Performance Analyzer that can help you identify slow queries. It’s easy to use:

  1. In Power BI Desktop, click the View ribbon and check the Performance Analyzer setting. This will open the Performance Analyzer pane.
  2. Click Start Recording in the Performance Analyzer pane. Once you start recording, any action that requires refreshing a visual, such as filtering or cross-highlighting, will populate the Performance Analyzer pane. You’ll see the statistics of each visual logged in the load order with its corresponding load duration.
  3. You can click the “Refresh visuals” link in Performance Analyzer to refresh all visuals on the page and capture all queries. However, once you are in a recording mode, every visual adds a new icon to help you refresh only that visual.

PerformanceAnalyzer

Performance Analyzer captures the following information:

  • DAX query – The length of time to execute the query.
  • Visual display – How long it took for the visual to render on the screen after the query is executed.
  • Other – This is the time that the visual spent in other tasks, such as preparing queries, waiting for other visuals to complete, or doing some other background processing.

Although not as comprehensive as DAX Studio, Performance Analyzer is a quick and easy way to get an overall idea of what impacts your report page performance. Performance Analyzer is currently only available in Power BI Desktop and it’s not available in Power BI Service.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

“Applied DAX with Power BI” Book Available

I’m excited to announce my latest book: “Applied DAX with Power BI: From zero to hero with 15-minute lessons“. This book was born out of necessity. In my consulting practice, I had been teaching and implementing Power BI and Analysis Services Tabular, and people were constantly asking for DAX book recommendations. Indeed, DAX is not an easy topic and has its ways to humble even experienced practitioners. There are a few good reference books out there, but they could be somewhat overwhelming for novice users. So, I turned my classroom and consulting experience into this book and designed it as a self-paced guide to help you learn DAX one lesson at a time.

This one was a different book to write. Unlike my other books, which tend to be heavy, this book is smaller (220 pages). Regardless, it took me a while to write it. The book also adopts a lesson-based approach,  so the reader can learn and practice a specific DAX concept quickly. There are usually three sections in each lesson: Practice (with step-by-step guidance), Outcome (a visual that shows the result), and Analysis (explaining why it works this way). Most lessons are five to six pages long, and it should take no more than 15 minutes to complete the lesson’s exercises (you can download and view a sample lesson from the book page).

And if you do one lesson per day, you’ll be a DAX expert in a month!

Currently, the book is available on Amazon (paperback and Kindle eBook) and it’s making its way to other outlets. As always, I’m awaiting feedback on the discussion forum at the bottom of the book page.

051119_2019_AppliedDAXw1.png

Fixing Random Problems with DAX

Scenario: You have a dimension table and a fact table, and you want to generate some test data, such by randomly looking up a value from the dimension table for each row in the fact table.

The following DAX formula should get the job done but RANDBETWEEN doesn’t work (although it works as a standalone function outside LOOKUPVALUE). Specifically, you get either the same value or an empty value.

Rand_Value_From_Other_Table =
LOOKUPVALUE (
Dim_Table[Fruit],
Dim_Table[Fruit_ID], ( RANDBETWEEN ( 1, MAX ( Dim_Table[Fruit_ID] ) ) )
)

Solution: I think the issue is that like Power Query, DAX assumes that functions are idempotent. Given the same arguments, they should produce the same result, which isn’t true for RANDBETWEEN and Power Query Number.Random functions. The simple solution is to change the first argument to perform a dummy operation that doesn’t use a constant value, such as:

Rand_Value_From_Other_Table =
LOOKUPVALUE (
Dim_Table[Fruit],
Dim_Table[Fruit_ID], ( RANDBETWEEN ( 1 + Fact_table[Fact_table_id] – Fact_table[Fact_table_id], MAX ( Dim_Table[Fruit_ID] ) ) )
)

Thanks to Stacey Jones who came up with this interesting puzzle.

Implementing Input Table in DAX Query

Scenario: If you want to pass an input table to a DAX query. For example, you might have an application that lets the user specify a subset of customers and as-of dates for each customer. Your query needs to calculate certain metrics, such as Sales, for the list of customers but as of the user-specified date for each customer, and potentially other parameters. It’s impractical to save these parameters in a physical table in your database because this requires a processing task. You may also want to avoid Analysis Services placing locks on the database for each refresh and thus blocking pending queries. Instead, you want to simply pass a collection of all the parameters the query needs to do its job, inside the query.

Solution: DAX supports static tables using the DATATABLE function, but the resulting data table is very limited in its support of features. First, you can’t name the table in your query, so you need to resort to using a variable. More importantly, many DAX operations that reference columns, such as attempting to compute MAX of a table column to get the “current” value, will error with “Table variable ‘_filter’ cannot be used in current context because a base table is expected”. You can’t also create physical relationships to a custom data table.

However, you can use TREATAS to establish virtual relationships based on the input parameters in the input table. The following query has a _filter variable that points to a custom data table that contains two customers with corresponding as-of dates. Then, the query uses TREATAS to evaluate the sales and order quantity for each customer as of the specified date. You can run this query against the AdventureWorks Tabular Model that comes with the SSAS samples.

DEFINE
VAR _filter =
DATATABLE ( “Customer Id”, STRING, “Date”, DATETIME,
{ { “AW00011000”, “7/22/2007”}, {“AW00011001”, “7/18/2005” } } )
EVALUATE
ADDCOLUMNS (
TREATAS( _filter , Customer[Customer Id], ‘Date'[Date] ),
“Sales”, CALCULATE(SUM(‘Internet Sales'[Sales Amount]) ),
“Quantity”, CALCULATE(SUM(‘Internet Sales'[Order Quantity]) )
)

030719_2024_Implementin1.png

One cautionary note here is that the cost of such “per-row” virtual relationships could be very expensive. When possible, you should always favor “batch”-oriented execution where the query uses global filters and physical relationships.

Thanks to Darren Gosbell and Marco Russo for helping with this.

DAX Calculation Groups

Just when I started thinking that there won’t be any new BI features in SQL Server 2019, Christian Wade announced DAX calculation groups in CTP 2.3. His excellent post helped me try them out with Tabular Editor and Power BI Desktop and whip out this cool report.

030319_2353_DAXCalculat1.png

As Chris Webb pointed out, DAX calculation groups let us implement time calculations as in Multidimensional with a “shell” time dimension. They probably won’t help you reducing the number of measures because you’d still need to flatten the “shell” dimension for maximum flexibility and avoid forcing the user to use the Time Calculations slicer to select the time calculations they want on the report. So, you would still have to project Sales Amount YTD as a separate measure:

[Sales Amount YTD] = CALCULATE (SUM ( 'Reseller Sales'[Sales Amount] ), 'Time Intelligence'[Time Calculation] = "YTD")

What calculation groups help with is avoiding copying and pasting the time intelligence formulas all over the place. A future CTP will also help centralize measure formatting where you’ll have the option to inherit the current measure format string (by leaving it blank), or override it.

To me,  DAX calculation groups is a nice to have feature, which I could have happily traded for more significant ones. Speaking of which, I hope Microsoft adds more important enhancements to Tabular 2019.

Here is my top 10:

  1. Anything that can help the poor developer debug and understand in what context a DAX expression is evaluated as DAX tends to have more exceptions than rules.
  2. DAX script – to show and organize all measures in one place. I miss also the MD scope assignments.
  3. Better Excel support
  4. Aggregations as in Power BI
  5. Composite models as in Power BI
  6. Recursive DAX measures
  7. Writeback
  8. Default members
  9. Actions (or make the JSON schema extensible so the community can fill in blanks)
  10. Full support for Tabular in Power BI, including Quick Insights, explain Increase/decrease, etc. My blog “Power BI Feature Discrepancies for Data Acquisition” has the details.

How Filters Affect DAX Measures

xVelocity (the storage engine behind Tabular, Power BI and columnstore indexes) is an in-memory columnar database, and it’s such it’s not suitable for detail-level reports, such as transaction-level reports. I wrote about performance implications with xVelocity and detail-level reporting here and here. In general, the lower the report grain and the more columns you add to the report (customer first name, last name, invoice number, etc.), the slower the performance will be as more and more columns would need to be scanned and cross-joined.

A recent Tabular project brought another twist. Most measures (about 250 total) were produced on top of a biggish snapshot table (250 million rows) and Customer table (12 million rows). As a best practice, base measures were created with common filters and then other measures would piggy back on the base measures like Russian dolls. It’s a best practice because if you must change a filter, you need to change only the base measures. For example, a [Number of Open Accounts] measure would filter the snapshot to find how many open accounts the customer has. Then, a [Number of Open Accounts: Banks] would apply another filter:

[Number of open accounts ever] (‘Creditor'[Subscriber Segment] = “Bank”)

Then, a web app will let the end user specify the customer identifiers and send a DAX query to the model to request all the 250 measures. The client complains that the query takes a long time and there isn’t much difference in the query execution for one customer or 100. To Microsoft credit, had the measures have the same filter, a “fusion cache” (introduced I believe in 2016 under “super DAX”) would kick in to reduce queries to the storage engine. For example, if the query requests SalesAmount, OrderQuantity, and other measures and they all have the same filters, this fusion cache will cache queries to the storage engine and help with the query execution. But the cache doesn’t work if measures have different filters irrespective if their definitions are nested or not. Consequently, each measure sends a separate query to the storage engine. So, if the query requests 100 measures for one customer, the storage engine will be queried 100 times and the 250 million snapshot table will be scanned 100 times for that one customer even if there are only a few transactions for that customer. This is a terribly inefficient way!

The first thing a database developer would do to solve this straight in SQL is to create a temporary table to extract the data for the selected customers before calculating the measures. But Tabular is not SQL Server. It would be great if at some point, it could introduce a caching mechanism to let the developer copy a subset of “rows” in the cache (in this case, extract rows from the Account Snapshot for the selected customers), and then transparently redirecting the snapshot queries to the cache (like how Power BI aggregations work today).

As things stand today, one way you can reduce the query execution time is to eliminate large joins. In this case, the DAX query filters the Customer table because some measures are produced from other tables. But you can split the query into multiple queries for the different tables and then apply the customer filter directly on the table bypassing the Customer table. So, instead of:

FILTER (VALUES (Customer[Customer ID] ), ‘Customer[Customer ID] IN { }

We want to the filter for measures involving only the snapshot to be:

FILTERVALUES ( ‘Account Snapshot’[Customer ID] ), Account Snapshot’[Customer ID] IN { }

In this case, eliminating the Account Snapshot -> Customer join resulted in 50% improvement in the query execution time.

But you should also consider using the right tool for the job at hand and for detail-level reports this could be a relational database which reads data by rows and not columns.

T-SQL, Power Query, or DAX?

I taught my Applied Power BI class last week to a group of smart data analysts. All of them were knowledgeable of T-SQL, which they have been using extensively for years to shape and transform the data and produce SSRS, Qlik and now Power BI reports. They haven’t previously used Power Query, but they liked my overview of its features. However, they were rightfully confused when to use T-SQL, Power Query, or DAX. Starting with the data source and moving up the chain, Power BI lets you use expressions in:

  1. Data source (custom SQL Query, SQL view)
  2. Power Query
  3. DAX calculated columns
  4. DAX measures (the uppermost level)

My advice is to invest time and learn all these capabilities. When you have diverse skills, you can use the best tool for the task at hand.

Although is hard to generalize as every data transformation is different, I recommend you shape the data as upstream (closer to the data source) as possible. So, if you are familiar with T-SQL and that’s your primary data source, then use T-SQL. As one of the Microsoft best products, SQL Server has been enjoying 30 years of continuous improvements. Not only can you apply the skills you already have, but you will gain in performance when you use T-SQL and delegate data crunching to SQL Server which is what it’s designed to do. You can also benefit from the rich data manipulation features of T-SQL. These are the same reasons why I favor the ELT (Extract, Load and Transform) pattern in organizational BI solutions instead of SSIS data flow transforms. And if you find T-SQL lacking in features, Power Query can supplement it nicely, such as to fill down missing values, quickly unpivot data, or apply fuzzy lookup.

When you connect to data sources that don’t support SQL, such as flat files or Excel, the next natural place is Power Query to shape and transform the data. The choice between Power Query and DAX calculated columns is a tricky one and typically involves a compromise between performance and skill set. In many cases, you’ll find that a custom column can be implemented both in Power Query and DAX calculated columns. In general, if performance is OK, use Power Query, which is further upstream than DAX calculated columns. As a bonus, your data model will see the custom columns as regular columns and compress them equally well. Consider DAX calculated columns (one level up Power Query) when:

  • You must use DAX features that Power Query lacks, such as ranking.
  • When Power Query transforms lead to long data refresh times, such as a lookup between two large tables.

Finally, while custom columns can be often implemented in any of the first three layers, DAX measures are unique, and they typically can be implemented in DAX only. For example, if you need the expression to reflect the end user filter selection, you must use a DAX measure because only DAX measures are evaluated at run time and can access runtime conditions, such as values from filters and slicers.

The Cost of xVelocity Relationships

How expensive is an xVelocity relationship? The answer depends on several factors, such as column cardinality, DAX calculations, and query itself. But it general, a relationship can add a significant overhead. Consider two tables: Customer (7.1 million rows) and Account Snapshot (187 million rows) and a relationship ‘Account Snapshot'[CustomerKey] -> Customer[CustomerKey]. I’ll use two queries for the test. The Relationship query uses the ‘Account Snapshot'[CustomerKey]->Customer[CustomerKey] relationship, while the No Relationship query scans directly the Account Snapshot table bypassing the relationship.

RelationshipNo Relationship
EVALUATE
CALCULATETABLE (
ADDCOLUMNS (
FILTER (
VALUES ( ‘Customer'[CustomerKey], … ),
‘Customer'[CustomerKey] IN { 731102730822895922 }
),
“Balance”, [Balance]
),
FILTER ( VALUES ( ‘Date'[Date] ), ‘Date'[Date] = DATE ( 2018, 3, 31 ) )
)
EVALUATE
CALCULATETABLE (
ADDCOLUMNS (
FILTER (
VALUES ( ‘Account Snapshot'[CustomerKey], … ),
‘Account Snapshot'[CustomerKey] IN { 731102730822895922}
),
“Balance”, [Balance]
),
FILTER ( VALUES ( ‘Date'[Date] ), ‘Date'[Date] = DATE ( 2018, 3, 31 ) )
)
1 customer: 198 (181/17)1 customer: 141 (84/57)
1,000 customers: 901(547/354)1,000 customers: 1,194 (315/877)

Each of these DAX queries results in two significant queries sent to the storage engine. The first calculates the measure for each customer and the second returns the selected customers (the filter clause in the query).  In the case of the Relationship query, this is what the first SE call looks like:

SELECT
‘Customer'[CustomerKey],
SUM ( ‘Account Snapshot'[Balance Base] )
FROM ‘Account Snapshot’
LEFT OUTER JOIN ‘Customer’ ON ‘Account Snapshot'[CustomerKey]=’Customer'[CustomerKey]

And, of course, there is no join for the No Relationship query. The statistics below the queries shows the query execution for 1 and 1,000 customers (customer identifiers were comma-separated and added to the IN clause). The first number shows the overall execution time in the storage engine while the number in parenthesis show the breakdown of the two SE queries.

As we can see, for one customer the relationship adds more than twice of overhead (181 vs 84 milliseconds). But for 1,000 customers, the overall execution time for the No Relationship query is higher 1,194 vs 901. How come? The first storage query is still faster (almost x2) but the second SE query is a way slower. As it turns out, the storage engine locates 1,000 rows in the smaller table (Customer) much faster than in the snapshot table when it executes the filter query. However, if we add more measures than at some point the No Relationship query would become faster. In a real-life project where the query requested some 200 measures, queries without relationship executed x3 faster than the ones with relationships.

xVelocity relationships between large tables could be expensive especially when the query requests many measures. If your query semantics allows it, consider denormalizing attributes from large dimension tables into fact tables to eliminate relationships.

One case where the query semantics might require a relationship is if it requests measures from multiple fact tables. But if you find that eliminating the relationship results in a significant performance boost, consider breaking down the query to request measures from each table and then union the result.

2-Day Applied DAX with Power BI Workshop in Atlanta (Oct 15-16)

Struggling with DAX? Data Analysis Expressions (DAX) is the expression language of Power BI, Power Pivot, and Analysis Services Tabular. It’s very powerful but it’s usually perceived as complex and requiring a steep learning curve. Taught by an established expert, this two-day workshop is designed to help you become proficient with DAX. Think of this workshop as advanced Power BI training. Reserve your seat today to attend this insightful 2-day workshop for only $999, when Teo Lachev (CEO of Prologika and Microsoft Data Platform MVP) teaches you the necessary DAX skills that you can immediately apply to your job.

Event Summary

  • Date: October 15 and 16
  • Time: 8:30 AM – 5 PM
  • Location: Microsoft Office in Alpharetta
  • Catering: lunch provided
  • Syllabus: available here
  • Price: $999 (use coupon SQL201810 for an instant 5% discount if you sign up two or more attendees from your company)
  • For more information and to register, go to http://bit.ly/daxworkshop201810

Key Benefits

  • Understand best practices for model design
  • Create advanced calculated columns
  • Master measures and evaluation contexts
  • Apply time intelligence
  • Work with advanced relationships and data security
  • Learn how to troubleshoot performance issues
  • Brainstorm your DAX puzzles
  • and much more…

091818_1243_2DayApplied1.png