Atlanta MS BI and Power BI Group Meeting on February 4th

MS BI fans, join us for the next Atlanta MS BI and Power BI Group meeting on February 4th, Monday at 6:30 PM at the Microsoft office in Alpharetta. The main presentation will be “Azure Data Lake for First Time Swimmers” by Samara Soucy  from Innovative Architects. Julies Smith will do us Prototypes with Pizza on “Python Fuzzy Wuzzy for Master Data Identity Mapping”. Innovative Architects will sponsor the event. For more details, visit our group page and don’t forget to RSVP (use the RSVP survey on the group page) if you’re planning to attend.

Presentation:Azure Data Lake for First Time Swimmers
Date:February 4, 2019, Monday
Time6:30 – 8:30 PM ET
Place:Microsoft Office (Alpharetta)

8000 Avalon Boulevard Suite 900
Alpharetta, GA 30009

Overview:Data Lake may be the hot new way to store data for reporting, but it can be scary if you’ve never dealt with one. If you have no idea what a data lake is, have no fear! You will get an intro to the concept of data lake and what scenarios you should be considering it as your platform. From there you’ll get to see how Azure Data Lake Analytics and U-SQL allow developers and data professionals to create compelling analytics with data that previously was difficult to work with.
Speaker:Samara is a Microsoft Certified Specialist in C#. Working in both web and data, she has experience in a wide variety of technologies including JS, .Net, SQL, Analysis Services, and Power BI. Working with Innovative Architects as a Software Develoment Consultant, she uses that knowledge to plug into whatever part of their system needs additional expertise.
Sponsor:Innovative Architects specializes in solving business problems around business intelligence, data management, e-commerce, portals & collaboration, custom application engineering and business process and integration.  https://innovativearchitects.com/
Prototypes with pizza“Python Fuzzy Wuzzy for Master Data Identity Mapping” by Julie Smith

092417_1708_AtlantaMSBI1.png

How Filters Affect DAX Measures

xVelocity (the storage engine behind Tabular, Power BI and columnstore indexes) is an in-memory columnar database, and it’s such it’s not suitable for detail-level reports, such as transaction-level reports. I wrote about performance implications with xVelocity and detail-level reporting here and here. In general, the lower the report grain and the more columns you add to the report (customer first name, last name, invoice number, etc.), the slower the performance will be as more and more columns would need to be scanned and cross-joined.

A recent Tabular project brought another twist. Most measures (about 250 total) were produced on top of a biggish snapshot table (250 million rows) and Customer table (12 million rows). As a best practice, base measures were created with common filters and then other measures would piggy back on the base measures like Russian dolls. It’s a best practice because if you must change a filter, you need to change only the base measures. For example, a [Number of Open Accounts] measure would filter the snapshot to find how many open accounts the customer has. Then, a [Number of Open Accounts: Banks] would apply another filter:

[Number of open accounts ever] (‘Creditor'[Subscriber Segment] = “Bank”)

Then, a web app will let the end user specify the customer identifiers and send a DAX query to the model to request all the 250 measures. The client complains that the query takes a long time and there isn’t much difference in the query execution for one customer or 100. To Microsoft credit, had the measures have the same filter, a “fusion cache” (introduced I believe in 2016 under “super DAX”) would kick in to reduce queries to the storage engine. For example, if the query requests SalesAmount, OrderQuantity, and other measures and they all have the same filters, this fusion cache will cache queries to the storage engine and help with the query execution. But the cache doesn’t work if measures have different filters irrespective if their definitions are nested or not. Consequently, each measure sends a separate query to the storage engine. So, if the query requests 100 measures for one customer, the storage engine will be queried 100 times and the 250 million snapshot table will be scanned 100 times for that one customer even if there are only a few transactions for that customer. This is a terribly inefficient way!

The first thing a database developer would do to solve this straight in SQL is to create a temporary table to extract the data for the selected customers before calculating the measures. But Tabular is not SQL Server. It would be great if at some point, it could introduce a caching mechanism to let the developer copy a subset of “rows” in the cache (in this case, extract rows from the Account Snapshot for the selected customers), and then transparently redirecting the snapshot queries to the cache (like how Power BI aggregations work today).

As things stand today, one way you can reduce the query execution time is to eliminate large joins. In this case, the DAX query filters the Customer table because some measures are produced from other tables. But you can split the query into multiple queries for the different tables and then apply the customer filter directly on the table bypassing the Customer table. So, instead of:

FILTER (VALUES (Customer[Customer ID] ), ‘Customer[Customer ID] IN { }

We want to the filter for measures involving only the snapshot to be:

FILTERVALUES ( ‘Account Snapshot’[Customer ID] ), Account Snapshot’[Customer ID] IN { }

In this case, eliminating the Account Snapshot -> Customer join resulted in 50% improvement in the query execution time.

But you should also consider using the right tool for the job at hand and for detail-level reports this could be a relational database which reads data by rows and not columns.

T-SQL, Power Query, or DAX?

I taught my Applied Power BI class last week to a group of smart data analysts. All of them were knowledgeable of T-SQL, which they have been using extensively for years to shape and transform the data and produce SSRS, Qlik and now Power BI reports. They haven’t previously used Power Query, but they liked my overview of its features. However, they were rightfully confused when to use T-SQL, Power Query, or DAX. Starting with the data source and moving up the chain, Power BI lets you use expressions in:

  1. Data source (custom SQL Query, SQL view)
  2. Power Query
  3. DAX calculated columns
  4. DAX measures (the uppermost level)

My advice is to invest time and learn all these capabilities. When you have diverse skills, you can use the best tool for the task at hand.

Although is hard to generalize as every data transformation is different, I recommend you shape the data as upstream (closer to the data source) as possible. So, if you are familiar with T-SQL and that’s your primary data source, then use T-SQL. As one of the Microsoft best products, SQL Server has been enjoying 30 years of continuous improvements. Not only can you apply the skills you already have, but you will gain in performance when you use T-SQL and delegate data crunching to SQL Server which is what it’s designed to do. You can also benefit from the rich data manipulation features of T-SQL. These are the same reasons why I favor the ELT (Extract, Load and Transform) pattern in organizational BI solutions instead of SSIS data flow transforms. And if you find T-SQL lacking in features, Power Query can supplement it nicely, such as to fill down missing values, quickly unpivot data, or apply fuzzy lookup.

When you connect to data sources that don’t support SQL, such as flat files or Excel, the next natural place is Power Query to shape and transform the data. The choice between Power Query and DAX calculated columns is a tricky one and typically involves a compromise between performance and skill set. In many cases, you’ll find that a custom column can be implemented both in Power Query and DAX calculated columns. In general, if performance is OK, use Power Query, which is further upstream than DAX calculated columns. As a bonus, your data model will see the custom columns as regular columns and compress them equally well. Consider DAX calculated columns (one level up Power Query) when:

  • You must use DAX features that Power Query lacks, such as ranking.
  • When Power Query transforms lead to long data refresh times, such as a lookup between two large tables.

Finally, while custom columns can be often implemented in any of the first three layers, DAX measures are unique, and they typically can be implemented in DAX only. For example, if you need the expression to reflect the end user filter selection, you must use a DAX measure because only DAX measures are evaluated at run time and can access runtime conditions, such as values from filters and slicers.

Atlanta MS BI and Power BI Group Meeting on January 7th

MS BI fans, join us for the next Atlanta MS BI and Power BI Group meeting on January 7th, Monday at 6:30 PM at the Microsoft office in Alpharetta. The main presentation will be “Predictive Analytics with Power BI”. I’ll walk you through the existing Power BI features for predictive analytics. Mark Tabladillo will show you exiting AI features that are coming up soon. CDATA will sponsor the event and do us a 15-min demo of their Power BI connector. For more details, visit our group page and don’t forget to RSVP (use the RSVP survey on the main page) if you’re planning to attend.

Presentation:Predictive Analytics with Power BI
Date:January 7, 2019, Monday
Time6:30 – 8:30 PM ET
Place:Microsoft Office (Alpharetta)

8000 Avalon Boulevard Suite 900
Alpharetta, GA 30009

Overview:Predictive analytics, also known as data mining, machine learning, and artificial intelligence (AI), is an increasingly popular requirement. Fortunately, you don’t have to be a data scientist to benefit from machine learning in Power BI. This session is organized in two parts:

  • Part 1 – We will revisit the existing machine learning features in Power BI, including Quick Insights, Explain Increase/Decrease, linear forecasting, integration with R and Python.
  • Part 2 – We will look at exciting new features that are coming up in near future, such as integration with Azure ML and Cognitive Services, and automated machine learning.
Speaker:Mark Tabladillo Ph.D. is a data scientist at Microsoft. His career has focused on industry application of advanced analytics, using a variety of analytics tools including SAS, SQL Server Analysis Services, Cortana Intelligence (including Microsoft R Server and Microsoft Machine Learning Services), R, and Python. He was a founding member of the Atlanta Microsoft BI User’s Group in 2010.

Through his Atlanta-based company Prologika (https://prologika.com), a Microsoft Gold Partner in Data Analytics, Teo Lachev helps organizations make sense of their most valuable asset: their data. Teo has authored and co-authored several bestselling books on organizational and self-service data analytics, and he has been leading the Atlanta Microsoft BI and Power BI group since he founded it in 2010. Teo has been a Microsoft Most Valued Professional (MVP) Data Platform since 2010.

Sponsor:CData Software is a leading provider of data access and connectivity solutions. We specialize in the development of Drivers and data access technologies for real-time access to on-line or on-premise applications, databases, and Web APIs. Our drivers are universally accessible, providing access to data through established data standards and application platforms such as ODBC, JDBC, ADO.NET, OData, SSIS, BizTalk, Excel, etc.

https://cdata.com

Prototypes with Pizza“TBD”


092417_1708_AtlantaMSBI1.png