Atlanta MS BI and Power BI Group Meeting on July 11th (Pushing the Query Folding limits with Power Query)

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, July 11th, at 6:30 PM ET.  For more details and sign up, visit our group page.

Presentation:Pushing the Query Folding limits with Power Query
Date:July 11th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:One of Power Query’s most powerful features is its ability to translate the Power Query formula language (M) back to a source systems native language and in this session, we’ll push the limits and possibilities to avoid “breaking the fold” and explore some potential dark magic with List functions. A base understanding of T-SQL is helpful though not required for this session.
Speaker:From financial services to felines, the World Wide Web to professional wrestling – Alex Powers has an affinity for the conventional and unconventional when it comes to information. A self-proclaimed Excel and Power BI Enthusiast Alex Powers enjoys contributing to online forums and sharing his passion for empowering others using Microsoft technologies.
Prototypes without PizzaPower BI Latest

PowerBILogo

Power BI Field Parameters

Coming back from a long vacation and I almost missed this new Power BI killer feature: Field Parameters! Not to be confused with Dynamic M Query Parameters that I ranted about here, field parameters solve a long-standing limitation of Power BI that prevents binding dynamically fields to a visual. Dynamic binding isn’t an issue with measures because they are dynamic and can evaluate runtime conditions, such as slicer selection, but dimensions are a different story. Once they are bound to a category bucket in a visual, you couldn’t change them on the fly.

Yet, one common scenario was to let the user control which fields will be used for slicing the measure(s) in a visual. I’ve seen rather convoluted implementations to get around this limitation. Field parameters to the rescue. Now once you create a field parameter and bind it to the visual, the user can simply select which field will be used for slicing.

Field parameters open the opportunity for packing more visuals on a single page and letting the user specify what they want to see in these visuals! Moreover, the fields can come from different tables. On the downside, one significant limitation not mentioned in the documentation, is that currently visuals can’t sort on the field parameter and no workaround exists (see this GitHub issue for details).

BTW, you can use this DAX measure to get the user-friendly selected value assuming you accepted the default name for the field parameter: ParameterTitle = MAX(Parameter[Parameter]). Or, because the parameter uses a groupby set, you can use this expression (thanks Alberto Ferrari):

ParameterTitle =
VAR _a = SUMMARIZE ( Parameter, Parameter[Parameter], Parameter[Parameter Fields] )
VAR _b = SELECTCOLUMNS ( _a, "Parameter", Parameter[Parameter] )
VAR _result = IF ( COUNTROWS ( _b ) = 1, _b )
RETURN
    _result

Power BI Datamarts: the Good, the Bad, and the Ugly

As Microsoft announced here, Power BI datamarts are upon us. I can almost see an important enterprise client demanding “self-service datamarts me now or else… “, thus inspiring an opportunity for another premium feature, spearheaded with great vision and effort, but questionable practical value. In a nutshell, a Power BI datamart is a combo of Power BI Premium and a Microsoft-hosted Azure SQL Database aiming to simplify the implementation of a departmental datamart.

The Good

Unlike other vendors, such as Domo and their proprietary and overly expensive stack, Microsoft has decided to go with somewhat open solution consisting of tools that Power BI users already know: Power Query, Power BI Desktop (for the first time some of its modeling features, such as relationships and DAX measures, made it to the cloud), and SQL Server. Microsoft provisions the database for you although surrounds it with some red tape (more on this in a moment). Thus, a business users aiming for “no code, low code” experience will whip out some dataflows that populate the database and then build a model (dataset) directly in Power BI Service. Obviously, the main goal is to simplify the experience as much as possible where all the action happens online.

It’s nice that Microsoft chose hosting the data in a SQL database instead of a “lakehouse”. Apparently, they learned some painful lessons from Power Query CDM folders. The database size is up to 100 GB which is not bad at all.

The Bad

From the announcement, “Best of all, IT doesn’t have to worry about getting all data into centrally governed data sources, thus providing discipline at the core and flexibility at the edge.” I failed to see how this will provide “discipline at the core” – a tenant that Microsoft learned from their own pain points after tilting too much toward self-service BI. I’ve seen also statements online that business users don’t have to “consult with IT anymore” when implementing datamarts. Really? What happened to managed self-service BI? I’m sure IT will be thrilled having corporate data in Microsoft-owned databases that they can’t manage and queries running amuck and consuming precious premium resources. Luckily, the admin portal has a switch to control who can create these datamarts. I hope at least we have a BYO database feature at some point.

The elastic Azure SQL database that Microsoft provisions is read-only, meaning that you can’t create objects. I’m a big fan of pushing calculations as much as possible to SQL Server, such as by implementing SQL views, but we can’t do that. Instead, we would use Power Query (what else of course) for all the transforms. But I have serious reservations against Power Query – a tool that is known to cause performance issues without providing any troubleshooting and maintenance insights.

The Ugly

Do we really need this feature? I would argue that what was really needed was extending Power Query with “destinations” where the user can specify where the data would land. If that was implemented, IT could selectively let business users augment the infrastructure set up by IT with self-service ETL (more than likely temporary) that sinks the data into an IT-sanctioned database. Further, it would have gotten us out of another proprietary mess that forces dataflows to save their output into CDM folders that make sense only to Microsoft (see my “Power BI Dataflows vs ADF Mapping Data Flows” blog for the gory details). Want to save dataflow data somewhere else? You got to use Power BI datamarts because this is the only way you can have your data in a (Microsoft) relational database and nowhere else.

Recently, an enterprise client has decided to migrate all self-service Alteryx flows to IT-governed ADF pipelines. More than likely, Power BI datamarts are heading in that direction. Be very careful about any pure self-service features, as you might find yourself in a bigger mess that you tried to solve.

A Case for Azure Analysis Services

Microsoft BI practitioners have three options for hosting semantic models: SSAS (on prem), Azure Analysis Services (cloud), and Power BI (cloud). AAS is somewhat caught between a rock and a hard place. Given that Power BI gets the most attention for cloud deployment, why would you consider AAS at all? There are two main reasons:

  1. Cost – Organizational semantic models might require a lot of memory and crunching power. Hosting them on AAS might be more cost effective. For example, AAS S4 runs at around $5,000 which at the same price point as Power BI Premium P1. However, it gives you 100 GB of RAM and 20 cores, whereas P1 has only 25 GB and 8 cores.
  2. Scaling out – A feature unique to AAS is ability to scale out to multiple query replicas. This is not an option with Power BI Premium, and it requires quite a bit of setup with SSAS. However, AAS makes scaling out easy by just changing a slider. And once you’re done, you can pause the instance, so it doesn’t incur cost!

Scaling out proved to be a useful feature lately when a client wanted to process massive queries in parallel. We cloned the model to AAS and wrote an ETL job to parallelize the query execution.

Note that the number of replicas depends on the data region and pricing level. For example, only East US 2 and West US support up to 7 query replicas up to S4. Another thing to watch for is that it’s not enough to just process the model on a scale-out farm. You’d need also to synchronize it across the query replicas. This could be done manually in the Azure Portal or automated, such by using the PowerShell script below that you can plug in a SQL Agent job. The script uses a regular AAD account which has admin rights to the server. You can also use a service principal, but I opted for a regular account because Microsoft removed the option for no expiration date for the client secret (the maximum lifetime of a client secret now is two years).

Import-Module Az.AnalysisServices
$password = "<account password>" | ConvertTo-SecureString -asPlainText -Force
$username = "craas@<domain>.com"
$aasendpoint = "asazure://aspaaseastus2.asazure.windows.net/crliveaas1"
$aasendpointmgmt = "asazure://aspaaseastus2.asazure.windows.net/crliveaas1:rw"
$TenantId = "<tenant id>"
$credential = New-Object System.Management.Automation.PSCredential($username,$password)
$defaultProfile = Connect-AzAccount -Credential $credential -Tenant $TenantId

Set-AzContext -Tenant $TenantId -DefaultProfile $defaultProfile
$server = Get-AzAnalysisServicesServer -ResourceGroupName "crliveaas_rg" -Name "crliveaas1" -DefaultProfile $defaultProfile
if ($server.State -eq "Paused")
{
    Resume-AzAnalysisServicesServer -Name "crliveaas1" -ResourceGroupName "crliveaas_rg"  
    #process database, first clear the data so processing doesn't go over memory limit
    Invoke-ProcessASDatabase -Server $aasendpointmgmt -DatabaseName "<databasename>" -RefreshType "ClearValues" -Credential $credential
    Invoke-ProcessASDatabase -Server $aasendpointmgmt -DatabaseName "<databasename>" -RefreshType "Full" -Credential $credential

    # sync database
   Add-AzAnalysisServicesAccount  -Credential:$credential -RolloutEnvironment:"aspaaseastus2.asazure.windows.net"
   Sync-AzAnalysisServicesInstance -Instance $aasendpointmgmt -Database "<databasename>" -PassThru
}

Atlanta MS BI and Power BI Group Meeting on June 6th (How Power Query Thinks)

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, June 6th, at 6:30 PM ET.  For more details and sign up, visit our group page.

Presentation:How Power Query Thinks: Taking the Mystery Out of Streaming and Query Folding
Date:June 6th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:How does Power Query produce the table data your expressions ask it to output? Query folding is one key concept—where behind the scenes, part or all of your M code may be rewritten into the data source’s native query language then offloaded to the source for execution. Streaming, though perhaps a less familiar term, is even more fundamental, as it describes how table data (as well as list and binary data) flows between functions in M.

 

Understand these concepts, and you’ll be better positioned to write more efficient mashups, debug problems and avoid unexpected variability in results. Join this session to learn about these key concepts—in a nutshell, to learn about How Power Query Thinks when you ask it to produce table data!

 

Speaker:Author of the Power Query M Primer Series, an in-depth dive into the Power Query language, Ben Gribaudo is a seasoned architect, developer and data engineer.
Prototypes without PizzaPower BI Latest

PowerBILogo

Power BI Dataflows vs ADF Mapping Data Flows

An enterprise client wants to migrate many Alteryx workflows created over years by smart business users to the Microsoft ecosystem. During the initial intake, we discussed Power BI dataflows vs Azure Data Factory mapping data flows. Yep, Microsoft loves to confuse us, but these technologies have nothing to do with each other.

Power BI DataflowsADF mapping data flows
Target audienceBusiness usersProfessionals
AvailabilityPower BI ServiceAzure Data Factory and Synapse Studio
Underlying technologyPower QuerySpark
Expression languageM LanguageExpression functions evaluated to Spark data types
Computation EnginePower BI Compute EngineSpark clusters
OutputAzure Data Lake Storage (CSV files in CDM folders)Many sinks
DebuggingN/ADebug mode
MonitoringVery basic (refresh failures)Detail output

In a nutshell, Power BI dataflows are meant for self-service ETL, where the business users would be responsible for creating and managing the ETL flows. By contrast, ADF mapping data flows target BI Pros. If you have been following my blog, you know that I’m big proponent for the ELT pattern for various reasons, mainly better performance and avoiding tool dependencies. Despite how much Microsoft promotes ADF mapping data flows, my typical data integration projects don’t use them. The last time I looked at them, they didn’t support ADF self-hosted runtimes, and they are a pain to debug. But if you must do transformations on the fly, e.g. when dumping data into files, then you obviously don’t have a choice.

So, what did the client decide to do? IT decided to take over the Alteryx workflows and convert them to Azure Data Factory. So much about self-service ETL.

FastTrack Recognized Solution Architect 2022

Microsoft re-awarded me as a FastTrack Recognized Solution Architect – Power BI for 2022! This prestigious recognition is conferred by the Power Platform product engineering team for consistently exhibiting deep architecture expertise and creating high quality solutions for customers during project engagements. I’m one of the 36 individuals worldwide who met the following criteria:

  • Must have a minimum of 2 years of experience with Power BI and a minimum of 5 years of experience with Enterprise BI solutions
  • Must have a minimum of 2 years of experience as an Enterprise BI architect
  • Must be working for a partner with Gold certification in Data Analytics MPN competency
  • Must have been lead architect for at least 2 Power BI in-production implementations with at least 200 active users (Preferably for CAT managed customers)

Atlanta MS BI and Power BI Group Meeting on April 4th

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, April 4th, at 6:30 PM ET.  Michael Carlo (Power BI MVP) will show how to speed up your data modeling experience by using DAX templates. For more details and sign up, visit our group page.

Presentation:World’s Fastest DAX – Using Quick DAX Templates
Date:April 4th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:DAX is easy to Learn but hard to master.

This session is all about using the Growing Community Driven Library of Quick DAX Templates. Learn how to find the templates and use them in your daily workflow to drastically speed up your Data Modeling Experience.

Session Covers the following topics:

1. Introduces the concept of Templatized DAX

2. Consuming a Quick DAX Template using Power BI External Tools

3. Discusses how to create your own templates and use them

Multiple demos of working directly inside the tooling for consuming and creating Quick DAX templates.

Speaker:Michael Carlo is very passionate about data and analytics. He spent a massive amount of time learning Power BI. Through this learning process he felt that others may also be interested. Thus, he created a website, a knowledge repository for all things PowerBI (powerbi.tips). This site is uniquely set up for people to read and learn about data modeling and visualizations within Power BI.
Prototypes without PizzaPower BI Latest

PowerBILogo

5 Storage Gotchas for Power BI Dataflows

Over the past few years, the BI industry has come up with new file formats, such as Parquet, ORC, and Avro, which are widely used today. To facilitate its vision for cross-industry data integration, Microsoft introduced a few years ago the Common Data Model (CDM) and CDM Folders. Power BI dataflows output CSV files to CDM folders and each table is saved in its own folder. You can bring your own data lake to directly access these files. If do so, you’ll find the following folder structure:

Although accessing the dataflow files might open all sorts of data integration scenarios, here are some things to watch for concerning the dataflow output:

  1. If you plan to migrate from other self-service ETL tools, such as Alteryx, note that a Power BI dataflow can output only to a CDM folder and export the data as CSV.
  2. Each time the dataflow refreshes, a new snapshot file is generated and added to the <table>.csv.snapshots folder. Currently, dataflows don’t delete previous snapshots and there is no retention policy. You could consider this a feature that lets traverse the dataflow run history, but very quickly you might end up with lots of files.
  3. Working the snapshot files is cumbersome. For example, if you plan to load the data directly from the CSV files, you’d typically want to access the latest data and you’d want the file name to stay the same. However, as you can see in the screenshot, the file name includes the timestamp. So, if the tool doesn’t have a CDM connector, it must sort the files in the folder and load from the top file.
  4. The metadata (column names and data types) is stored in the model.json file. Continuing the previous example, you won’t get the column headers and types if you just load the snapshot file.
  5. Very few tools today support CDM folders. To support them, the tool must first query the model.json file to determine the location of the latest snapshot. The tool must then apply the data types from model.json. The Azure Data Lake Gen 2 connector in Power BI support them, but it’s been in a perpetual Beta. Azure Data Factory supports CDM, but it requires mapping data flows that I typically try to avoid. Microsoft has a Databricks package that understand CDM. I see that Informatica has put up a connector. This is a timid response from the industry considering the Microsoft’s ambitious vision. Even Microsoft’s own Synapse Serverless doesn’t support them yet. Why didn’t Microsoft decide to use modern and established file formats, such as Parquet, that can save both the metadata and data in the same file? It’s clear that Microsoft opted for the lowest denominator that every tool supports, which is CSV. But because CSV files don’t include the metadata, Microsoft had to find a way to provide it.

As a workaround for these limitations, consider implementing Power BI datasets that wrap the dataflows using the Dataflow connector. If you use Power BI Premium or PPU, enable DirectQuery in the enhanced compute engine, so that you don’t have to import the data (the only data connectivity supported by the ADLS connector is import) and you don’t have to refresh both the dataset and dataflow. Looking forward, I’d like to see dataflows supporting output settings, such as retention, immutable file name for the latest snapshot. Dataflows should add an output connector to define where the data should be sent, such as to output to Parquet files to store both the schema and metadata in the file or output to a relational database. I’d like also to see Synapse Serverless extended to support CDM folders.

In summary, yes, you can directly access the dataflow raw files in your own lake, but as it stands, the CDM folder implementation limits your data integration options.

Referencing Columns in DAX Table Variables

Suppose you use a DAX table variable, such as to group by certain columns and add an extension column as a calculation. Then, you want to count the rows in the table by filtering on one of the columns. At your first attempt, you might try using CALCULATE.

VAR _t = ADDCOLUMNS(SUMMARIZE(…), "SomeColumn", <some expression>)
RETURN
CALCULATE(COUNTROWS(_t), _t[SomeColumn] = something)

You’ll get an error that the column you reference cannot be found. To get this to work, you must use FILTER (notice that code uses [SomeColumn] instead of the column fully qualified name (_t[SomeColumn).

VAR _t = ADDCOLUMNS(SUMMARIZE(…), "SomeColumn", <some expression>)
RETURN
COUNTROWS(FILTER(_t, [SomeColumn] = something)

Why doesn’t CALCULATE work? CALCULATE changes the filter context, but DAX filter context only allows columns in the model. It doesn’t allow “extension” columns created in a DAX expression. [SomeColumn] does not exist in the model but only in the DAX expression and therefore is not visible by CALCULATE.

On the other hand, FILTER resolves [SomeColumn] in row context. [SomeColumn] is bound to the column in table expression _t and it can be resolved.

Speaking of grouping, you might have noticed that Power BI Desktop uses SUMMARIZECOLUMNS in autogenerated DAX queries instead of ADDCOLUMNS(SUMMARIZE()). However, if you attempt to use SUMMARIZECOLUMNS in a measure, you’ll get an error “SummarizeColumns() and AddMissingItems() may not be used in this context”. How come?

SUMMARIZECOLUMNS is not supported in measures because it was designed for resultset-producing, autogenerated DAX queries. It wasn’t enabled for measures because it has different semantics when filters exist in the filter context, as in the measure scenario, and when filters are passed to it as arguments, as in the DAX query scenario. If you want to use SUMMARIZECOLUMNS in a measure to replicate the query results of a visual, you may not get the expected results due to the different ways the function treats filters from different sources and so it can’t be used in measures.