DAX Calculation Groups

Just when I started thinking that there won’t be any new BI features in SQL Server 2019, Christian Wade announced DAX calculation groups in CTP 2.3. His excellent post helped me try them out with Tabular Editor and Power BI Desktop and whip out this cool report.

030319_2353_DAXCalculat1.png

As Chris Webb pointed out, DAX calculation groups let us implement time calculations as in Multidimensional with a “shell” time dimension. They probably won’t help you reducing the number of measures because you’d still need to flatten the “shell” dimension for maximum flexibility and avoid forcing the user to use the Time Calculations slicer to select the time calculations they want on the report. So, you would still have to project Sales Amount YTD as a separate measure:

[Sales Amount YTD] = CALCULATE (SUM ( 'Reseller Sales'[Sales Amount] ), 'Time Intelligence'[Time Calculation] = "YTD")

What calculation groups help with is avoiding copying and pasting the time intelligence formulas all over the place. A future CTP will also help centralize measure formatting where you’ll have the option to inherit the current measure format string (by leaving it blank), or override it.

To me,  DAX calculation groups is a nice to have feature, which I could have happily traded for more significant ones. Speaking of which, I hope Microsoft adds more important enhancements to Tabular 2019.

Here is my top 10:

  1. Anything that can help the poor developer debug and understand in what context a DAX expression is evaluated as DAX tends to have more exceptions than rules.
  2. DAX script – to show and organize all measures in one place. I miss also the MD scope assignments.
  3. Better Excel support
  4. Aggregations as in Power BI
  5. Composite models as in Power BI
  6. Recursive DAX measures
  7. Writeback
  8. Default members
  9. Actions (or make the JSON schema extensible so the community can fill in blanks)
  10. Full support for Tabular in Power BI, including Quick Insights, explain Increase/decrease, etc. My blog “Power BI Feature Discrepancies for Data Acquisition” has the details.

First Look at Power BI AutoML

In my previous “First Look at the New Power BI Integration with Azure ML” blog, I showed you how Power BI will make it very easy to integrate it with Azure Machine Learning and Cognitive Services. Data scientists today have plenty of options to create predictive models but all of them have one thing in common: expertise in data science and statistics.

The promise of AutoML is to bring predictive analytics to data analysts just like self-service BI brought analytics to business users.

Let’s see how AutoML works based on what’s in the private preview (the usual disclaimer is that things will probably change). To start with, AutoML requires a dataflow (a note to Microsoft here is that AutoML will become more pervasive if it’s available in Power BI Desktop and it doesn’t require a premium capacity). In the private preview, AutoML requires the following steps. Presumably. the first (and most difficult step), preparing the dataset and cleansing the data is already done and available as a dataflow entity:

030319_1624_FirstLookat1.png

Choosing model type

In private preview, choosing a model type requires the data analysts to select one of four model types:

  • Binary Prediction – a type of a classification task where the model predicts only two (Boolean) states, such as if a customer is a potential buyer or not. It requires a Boolean outcome field.
  • General Classification – a type of a classification task where the model predicts more states, such as if the customer’s risk falls in Low, Medium, or High risk.
  • Regression – Similar to classification, regression predicts a numeric outcome, such as the customer risk on the scale from 0 to 10.
  • Forecasting – Time series forecasting, such as to predict the revenue for future months.

I’d argue here that if want to make this super easy, we shouldn’t be asking a business user about the type of model. Power BI can simply examine the field that needs to be predicted, which happens in the next step, and suggest the model type so the current choices can be collapsed between Classification and Forecasting.

Selecting training data

Next, you need to select the field that needs to be predicted (outcome field) and input columns. This step is the most important and arguably the most difficult even for data scientists because you need to have a business domain knowledge to guess which input fields are the most significant. Modelers tend to lean on the safer side by selecting all fields which dilutes the model accuracy. Here, it will be nice if AutoML suggests which fields are the most significant. AutoML has a validation report that shows you the key influencers after the model is ready, but it would be better if it has similar feature to aid you in the field selection.

The wizard’s next step depends on the mode type. For binary classification, it asks you to specify the field names for two states, such as Buyer and NonBuyer.

Training the model

The model is ready but requires an entity refresh to train it. At this point, AutoML generates two additional entities for providing inputs for testing and training data, such as 80% of the input dataset is used to train the model (generate the predictive patterns) and 20% is used to test the model accuracy against the historical data. The assumption here is probably let the user to fine tune the training and testing data if they want to have more control. I think though that this is too technical, and it should be handled internally by AutoML to reduce the number of entities and overall complexity.

The actual model exists on the Power BI capacity. All the training happens there, and the resulting patterns are stored there. Data does not leave the capacity and the model is not published as an experiment in Azure ML.

Iterating and retraining

One thing is to prepare the model, which is easy with AutoML, but another thing is to make sure that the model is useful and results in a true gain. This is where things get technical but AutoML seeks to simplify this with a validation report. The report has three tabs:

  • Model Performance – The most important information here is the key influencer charts that show you which input fields are the top predictors. You should review this chart and possibly remove input fields that are insignificant. This page also has the only input change you can make to influence the model prediction, which is to specify the probability threshold. For example, for Binary Prediction you can change the probability threshold to 50% to indicate that the customer is considered a buyer if the predicted score is equal or above 50%.
  • Accuracy Report – The most important information on this page is the so called lift chart. You want to verify here that the solid line is above the diagonal line which represents a random guess, i.e. if we don’t use predicative analytics but just randomly guess which customers will become buyers.
  • Technical Details – This page has additional technical information which will be over the head of data analysts but could be useful to pass them on to the data scientist, such as what parameters were used for the model algorithm.

Applying the model

Now that you have the model trained, you can operationalize it by applying it to your original entity, that was used to train the model, or to another entity, such as a list of new customers, as long as it has the same fields that are used as input to the model. Applying the model to an entity adds yet another (computed) entity which has the original fields, plus the predicted. For example, the Binary Prediction model type adds two new fields:

  • Outcome – TRUE if the model predicted that the customer will be a buyer and FALSE otherwise.
  • PredictionScore – The likelihood of the prediction from 0 to 100.
  • PredictionExplanation – Additional details that include the weights of each input field to explain the predicted outcome.

So, once all is done, you’ll have 7 entities:

  • Original entity
  • AutoML entity – trains the model
  • Training data – entity used for training the model
  • Testing data – entity used for testing the model
  • AutoMLScore – a function used for scoring the input dataset
  • AutoMLPollingQuery – a helper entity which I believe returns the identifier of the model

I’d personally prefer less entities. For example, I’d rather see the predicted fields added to the entity to which the model is applied to, as opposed to having another “enriched” entity.

Power BI AutoML brings predictive analytics to data analysts. Powered by ML.NET, AutoML automates the creating of predictive models to handle common predictive tasks, such as classification and forecasting. I hope Microsoft simplifies the process by reducing the number of entities and refreshes required, and by bringing AutoML to Power BI Desktop.

First Look at the New Power BI Integration with Azure ML

Currently, Power BI doesn’t integrate natively with Azure ML so you must resort to using the Web connector to call the experiment’s web service. I demonstrate how you can do this in a singleton way (one row at a time) in chapter 11 of my latest Power BI book (you can download the code from the book page). Microsoft announced several forthcoming Machine Learning enhancements, including integration with Cognitive Services and Azure ML, and the new Auto ML features (skip to 1:08:30).

In this blog, I’ll review the Power BI integration with Azure ML, which is currently in private preview but expected to be available for public preview soon. I’ll use the Adventure Works Bike Buyer Azure ML predictive model (explained in more detail in my book). The predictive model is simple. The model is trained with data from the vTargetMail view in the AdventureWorksDW database and predicts the probability of a customer to purchase a product based on a small subset of features, such as the customer age, geographic region, house owner, number of cars owned, etc.

030219_2041_FirstLookat1.png

When you publish the web service in Azure ML, you must use the Deploy Web Service [New] Preview option. Power BI integration won’t work if you deploy it as a classic web service. There are more prerequisites required to enable integration to Azure ML and Cognitive Services, such as granting the consumers specific rights, which will be documented. Currently, this integration scenario is available in Power BI Service dataflows only. In addition, the dataflow must be created in a premium capacity and the AI Insights workload must be enabled.

Those of you who had to use the Web connector in the past to integrate Power BI with Azure ML will appreciate the simplicity of the new integration:

  1. Create a dataflow with an entity of the dataset you want to score.
  2. Click the AI Insights button in the entity toolbar. If you don’t see this button, the capacity is not a premium capacity, or the AI Insights workload is not enabled.
  3. Power BI enumerates all “new style’ web services assigned to price plans in the Power BI tenant (could be in different subscriptions) and show them in the Invoke Function window (the screenshot below shows only one web service).

  4. Once you select the Azure ML service, Power BI will ask you to map the columns in your data entity as input columns (features) to the web service.
  5. Next, Power BI adds a new “AzureML.WebServiceName”column to the entity dataset. Specifically, Power BI invokes the Azure ML web service in batch mode, which is more efficient that calling one row at the time. The underlying invocation and integration details with Azure ML are not available to you.
  6. You can expand the new column to get the scored label and probability. That’s it! That’s exactly how I imagined the Azure ML integration to work and I’m glad to see how easy it is.scoredcolumns

The new Power BI-Azure ML integration makes it very easy for a data analyst to score a dataset against a previously published (presumably by a data scientist) Azure ML experiment. In the preview, this integration requires a premium capacity and dataflow. I hope Microsoft brings this integration scenario to Power BI Desktop too.

Atlanta MS BI and Power BI Group Meeting on March 4th

MS BI fans, join us for the next Atlanta MS BI and Power BI Group meeting on March 4th, Monday, at 6:30 PM at the Microsoft office in Alpharetta. The main presentation will be “Power BI Architecture” by Paco Gonzalez  from SolidQ. Eric Flamm will do us Prototypes with Pizza on configuring a Power BI visual for mobile layout. SolidQ will sponsor the event. For more details, visit our group page and don’t forget to RSVP (use the RSVP survey on the group page) if you’re planning to attend.

Presentation:Power BI Architecture
Date:March 4, 2019, Monday
Time6:30 – 8:30 PM ET
Place:Microsoft Office (Alpharetta)

8000 Avalon Boulevard Suite 900
Alpharetta, GA 30009

Overview:Having the right architecture is key to the adoption of Power BI, and to make the most of your business intelligence solution. This session guides you through different Power BI architectures, such as Direct Query, Import, Connect Live, Composite Models, and Embedded. We will analyze different techniques in data refresh, incremental extraction, integration and collaboration to provide a quick overview of the different approaches available.
Speaker:SolidQ North America CEO and Microsoft Data Platform MVP. Paco Gonzalez has more than 15 years leading teams for high visibility solutions on Data Platform, Business Intelligence, and Machine Learning. He is also a frequent speaker at large and small conferences. SQL Saturdays, TechEd’s, PASS BA, PASS Summit,  DevWeek London, PAW Chicago and London, or Ignite.
Sponsor:The tools and technologies to manage information will constantly evolve, but the need for accurate and actionable information is timeless. SolidQ delivers services for Microsoft platforms that help you architect, integrate and optimize your use of data. The result? Our clients think bigger and move faster because we help them build the capacity and skills to interact with data in creative, collaborative ways that deliver new insights to the business. At SolidQ we measure our success by your satisfaction – and we guarantee it.
Prototypes with pizza“Move your PowerBI Viz to your Phone” by Eric Flamm

092417_1708_AtlantaMSBI1.png

2019 Gartner Magic Quadrant for Analytics and Business Intelligence Platforms

The 2019 Gartner Quadrant for Analytics and BI Platforms is out and Microsoft BI aficionados like your humble correspondent have all the reasons to rejoice. Say whatever you want about Gartner but the quadrant is a great testimonial for Power BI and I think the highest score was overdue. Microsoft still has tons of work on the visualization side of things but I’m glad that Gartner recognized the business value of Power BI which exceeds by far any other BI tool on the market. A few comments on the cautions in the report:

  • Differences in on-premises and cloud service with Azure cloud only – This is about the discrepancies between Power BI and Power BI Report Server which I discussed last year. Revisiting this, I’d like to know how many customers have complained. In my practice, I don’t see much interest in Power BI Report Server for deploying Power BI reports as everyone wants to be in the cloud.
  • Integration of Mode 1 and Mode 2 – This is the same “caution” as last year which is already covered by the first caution so I’m not sure why it’s reiterated. “With Power BI, Microsoft has mainly focused on requirements for Mode 2 (agile, self-service) analytics. On-premises SQL Server Reporting Services meets the needs of Mode 1”. I think Gartner will be happy once the SSRS integration is fully baked in Power BI Premium.
  • Multiple products – Kind of repeated from last year by phrased somewhat differently “Although the core of Power BI is a self-contained product, Microsoft’s roadmap spans multiple products.” So, the same comment. I don’t see a problem with having multiple products if they can integrate. Especially given that some are free, such as PowerApps and Flow with O365. The reason why they are separate products is that they don’t apply to BI only. And which BI vendor has flow, business apps, predictive models, in the same product? Not to mention that other vendors don’t have anything more than just a reporting tool.

Perhaps, the next most difficult thing after getting to the top is remaining on the top. Will Microsoft sustain the Power BI leader position or gradual attrition and shifting priorities will let smaller and more agile vendors dethrone it?

021419_2128_2019Gartner1.png

Atlanta MS BI and Power BI Group Meeting on February 4th

MS BI fans, join us for the next Atlanta MS BI and Power BI Group meeting on February 4th, Monday at 6:30 PM at the Microsoft office in Alpharetta. The main presentation will be “Azure Data Lake for First Time Swimmers” by Samara Soucy  from Innovative Architects. Julies Smith will do us Prototypes with Pizza on “Python Fuzzy Wuzzy for Master Data Identity Mapping”. Innovative Architects will sponsor the event. For more details, visit our group page and don’t forget to RSVP (use the RSVP survey on the group page) if you’re planning to attend.

Presentation:Azure Data Lake for First Time Swimmers
Date:February 4, 2019, Monday
Time6:30 – 8:30 PM ET
Place:Microsoft Office (Alpharetta)

8000 Avalon Boulevard Suite 900
Alpharetta, GA 30009

Overview:Data Lake may be the hot new way to store data for reporting, but it can be scary if you’ve never dealt with one. If you have no idea what a data lake is, have no fear! You will get an intro to the concept of data lake and what scenarios you should be considering it as your platform. From there you’ll get to see how Azure Data Lake Analytics and U-SQL allow developers and data professionals to create compelling analytics with data that previously was difficult to work with.
Speaker:Samara is a Microsoft Certified Specialist in C#. Working in both web and data, she has experience in a wide variety of technologies including JS, .Net, SQL, Analysis Services, and Power BI. Working with Innovative Architects as a Software Develoment Consultant, she uses that knowledge to plug into whatever part of their system needs additional expertise.
Sponsor:Innovative Architects specializes in solving business problems around business intelligence, data management, e-commerce, portals & collaboration, custom application engineering and business process and integration.  https://innovativearchitects.com/
Prototypes with pizza“Python Fuzzy Wuzzy for Master Data Identity Mapping” by Julie Smith

092417_1708_AtlantaMSBI1.png

How Filters Affect DAX Measures

xVelocity (the storage engine behind Tabular, Power BI and columnstore indexes) is an in-memory columnar database, and it’s such it’s not suitable for detail-level reports, such as transaction-level reports. I wrote about performance implications with xVelocity and detail-level reporting here and here. In general, the lower the report grain and the more columns you add to the report (customer first name, last name, invoice number, etc.), the slower the performance will be as more and more columns would need to be scanned and cross-joined.

A recent Tabular project brought another twist. Most measures (about 250 total) were produced on top of a biggish snapshot table (250 million rows) and Customer table (12 million rows). As a best practice, base measures were created with common filters and then other measures would piggy back on the base measures like Russian dolls. It’s a best practice because if you must change a filter, you need to change only the base measures. For example, a [Number of Open Accounts] measure would filter the snapshot to find how many open accounts the customer has. Then, a [Number of Open Accounts: Banks] would apply another filter:

[Number of open accounts ever] (‘Creditor'[Subscriber Segment] = “Bank”)

Then, a web app will let the end user specify the customer identifiers and send a DAX query to the model to request all the 250 measures. The client complains that the query takes a long time and there isn’t much difference in the query execution for one customer or 100. To Microsoft credit, had the measures have the same filter, a “fusion cache” (introduced I believe in 2016 under “super DAX”) would kick in to reduce queries to the storage engine. For example, if the query requests SalesAmount, OrderQuantity, and other measures and they all have the same filters, this fusion cache will cache queries to the storage engine and help with the query execution. But the cache doesn’t work if measures have different filters irrespective if their definitions are nested or not. Consequently, each measure sends a separate query to the storage engine. So, if the query requests 100 measures for one customer, the storage engine will be queried 100 times and the 250 million snapshot table will be scanned 100 times for that one customer even if there are only a few transactions for that customer. This is a terribly inefficient way!

The first thing a database developer would do to solve this straight in SQL is to create a temporary table to extract the data for the selected customers before calculating the measures. But Tabular is not SQL Server. It would be great if at some point, it could introduce a caching mechanism to let the developer copy a subset of “rows” in the cache (in this case, extract rows from the Account Snapshot for the selected customers), and then transparently redirecting the snapshot queries to the cache (like how Power BI aggregations work today).

As things stand today, one way you can reduce the query execution time is to eliminate large joins. In this case, the DAX query filters the Customer table because some measures are produced from other tables. But you can split the query into multiple queries for the different tables and then apply the customer filter directly on the table bypassing the Customer table. So, instead of:

FILTER (VALUES (Customer[Customer ID] ), ‘Customer[Customer ID] IN { }

We want to the filter for measures involving only the snapshot to be:

FILTERVALUES ( ‘Account Snapshot’[Customer ID] ), Account Snapshot’[Customer ID] IN { }

In this case, eliminating the Account Snapshot -> Customer join resulted in 50% improvement in the query execution time.

But you should also consider using the right tool for the job at hand and for detail-level reports this could be a relational database which reads data by rows and not columns.

T-SQL, Power Query, or DAX?

I taught my Applied Power BI class last week to a group of smart data analysts. All of them were knowledgeable of T-SQL, which they have been using extensively for years to shape and transform the data and produce SSRS, Qlik and now Power BI reports. They haven’t previously used Power Query, but they liked my overview of its features. However, they were rightfully confused when to use T-SQL, Power Query, or DAX. Starting with the data source and moving up the chain, Power BI lets you use expressions in:

  1. Data source (custom SQL Query, SQL view)
  2. Power Query
  3. DAX calculated columns
  4. DAX measures (the uppermost level)

My advice is to invest time and learn all these capabilities. When you have diverse skills, you can use the best tool for the task at hand.

Although is hard to generalize as every data transformation is different, I recommend you shape the data as upstream (closer to the data source) as possible. So, if you are familiar with T-SQL and that’s your primary data source, then use T-SQL. As one of the Microsoft best products, SQL Server has been enjoying 30 years of continuous improvements. Not only can you apply the skills you already have, but you will gain in performance when you use T-SQL and delegate data crunching to SQL Server which is what it’s designed to do. You can also benefit from the rich data manipulation features of T-SQL. These are the same reasons why I favor the ELT (Extract, Load and Transform) pattern in organizational BI solutions instead of SSIS data flow transforms. And if you find T-SQL lacking in features, Power Query can supplement it nicely, such as to fill down missing values, quickly unpivot data, or apply fuzzy lookup.

When you connect to data sources that don’t support SQL, such as flat files or Excel, the next natural place is Power Query to shape and transform the data. The choice between Power Query and DAX calculated columns is a tricky one and typically involves a compromise between performance and skill set. In many cases, you’ll find that a custom column can be implemented both in Power Query and DAX calculated columns. In general, if performance is OK, use Power Query, which is further upstream than DAX calculated columns. As a bonus, your data model will see the custom columns as regular columns and compress them equally well. Consider DAX calculated columns (one level up Power Query) when:

  • You must use DAX features that Power Query lacks, such as ranking.
  • When Power Query transforms lead to long data refresh times, such as a lookup between two large tables.

Finally, while custom columns can be often implemented in any of the first three layers, DAX measures are unique, and they typically can be implemented in DAX only. For example, if you need the expression to reflect the end user filter selection, you must use a DAX measure because only DAX measures are evaluated at run time and can access runtime conditions, such as values from filters and slicers.

Atlanta MS BI and Power BI Group Meeting on January 7th

MS BI fans, join us for the next Atlanta MS BI and Power BI Group meeting on January 7th, Monday at 6:30 PM at the Microsoft office in Alpharetta. The main presentation will be “Predictive Analytics with Power BI”. I’ll walk you through the existing Power BI features for predictive analytics. Mark Tabladillo will show you exiting AI features that are coming up soon. CDATA will sponsor the event and do us a 15-min demo of their Power BI connector. For more details, visit our group page and don’t forget to RSVP (use the RSVP survey on the main page) if you’re planning to attend.

Presentation:Predictive Analytics with Power BI
Date:January 7, 2019, Monday
Time6:30 – 8:30 PM ET
Place:Microsoft Office (Alpharetta)

8000 Avalon Boulevard Suite 900
Alpharetta, GA 30009

Overview:Predictive analytics, also known as data mining, machine learning, and artificial intelligence (AI), is an increasingly popular requirement. Fortunately, you don’t have to be a data scientist to benefit from machine learning in Power BI. This session is organized in two parts:

  • Part 1 – We will revisit the existing machine learning features in Power BI, including Quick Insights, Explain Increase/Decrease, linear forecasting, integration with R and Python.
  • Part 2 – We will look at exciting new features that are coming up in near future, such as integration with Azure ML and Cognitive Services, and automated machine learning.
Speaker:Mark Tabladillo Ph.D. is a data scientist at Microsoft. His career has focused on industry application of advanced analytics, using a variety of analytics tools including SAS, SQL Server Analysis Services, Cortana Intelligence (including Microsoft R Server and Microsoft Machine Learning Services), R, and Python. He was a founding member of the Atlanta Microsoft BI User’s Group in 2010.

Through his Atlanta-based company Prologika (https://prologika.com), a Microsoft Gold Partner in Data Analytics, Teo Lachev helps organizations make sense of their most valuable asset: their data. Teo has authored and co-authored several bestselling books on organizational and self-service data analytics, and he has been leading the Atlanta Microsoft BI and Power BI group since he founded it in 2010. Teo has been a Microsoft Most Valued Professional (MVP) Data Platform since 2010.

Sponsor:CData Software is a leading provider of data access and connectivity solutions. We specialize in the development of Drivers and data access technologies for real-time access to on-line or on-premise applications, databases, and Web APIs. Our drivers are universally accessible, providing access to data through established data standards and application platforms such as ODBC, JDBC, ADO.NET, OData, SSIS, BizTalk, Excel, etc.

https://cdata.com

Prototypes with Pizza“TBD”


092417_1708_AtlantaMSBI1.png

Applied Power BI Book (4th Edition)

I’m excited to announce the fourth edition of my Applied Microsoft Power BI book! When the original book was published  in January 2016, it was the first Power BI book at that time and it had less than 300 pages.  Since then, I helped many companies adopt or transition to Power BI and taught hundreds of students. It’s been a great experience to witness the momentum surrounding Power BI and how the tool matured over time. As a result, the book got thicker too and it now stands at 500 pages. However, I believe what’s more important is that this book provides systematic, yet dependent, view by showing what Power BI can do for four types of users (business users, analysts, pros, and developers).

This is the only Power BI book that gets annual revisions to keep it up to date with this ever changing technology! Because I had to draw a line somewhere, Applied Microsoft Power BI (4th edition) covers all features that are in preview or released by December 2018. As with my previous books, I’m committed to help my readers with book-related questions and welcome all feedback on the book discussion forum on the book page. While you are there, feel free to check out the book resources (sample chapter, front matter, and more). I also encourage you to follow my blog at https://prologika.com/blog and subscribing to my newsletter at https://prologika.com to stay on the Power BI latest.

Bring your data to life for the forth time! Keep on reading, learning, and Happy New Year!

apbi4