Atlanta MS BI and Power BI Group Meeting on June 6th (How Power Query Thinks)

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, June 6th, at 6:30 PM ET.  For more details and sign up, visit our group page.

Presentation:How Power Query Thinks: Taking the Mystery Out of Streaming and Query Folding
Date:June 6th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:How does Power Query produce the table data your expressions ask it to output? Query folding is one key concept—where behind the scenes, part or all of your M code may be rewritten into the data source’s native query language then offloaded to the source for execution. Streaming, though perhaps a less familiar term, is even more fundamental, as it describes how table data (as well as list and binary data) flows between functions in M.

 

Understand these concepts, and you’ll be better positioned to write more efficient mashups, debug problems and avoid unexpected variability in results. Join this session to learn about these key concepts—in a nutshell, to learn about How Power Query Thinks when you ask it to produce table data!

 

Speaker:Author of the Power Query M Primer Series, an in-depth dive into the Power Query language, Ben Gribaudo is a seasoned architect, developer and data engineer.
Prototypes without PizzaPower BI Latest

PowerBILogo

Power BI Dataflows vs ADF Mapping Data Flows

An enterprise client wants to migrate many Alteryx workflows created over years by smart business users to the Microsoft ecosystem. During the initial intake, we discussed Power BI dataflows vs Azure Data Factory mapping data flows. Yep, Microsoft loves to confuse us, but these technologies have nothing to do with each other.

Power BI DataflowsADF mapping data flows
Target audienceBusiness usersProfessionals
AvailabilityPower BI ServiceAzure Data Factory and Synapse Studio
Underlying technologyPower QuerySpark
Expression languageM LanguageExpression functions evaluated to Spark data types
Computation EnginePower BI Compute EngineSpark clusters
OutputAzure Data Lake Storage (CSV files in CDM folders)Many sinks
DebuggingN/ADebug mode
MonitoringVery basic (refresh failures)Detail output

In a nutshell, Power BI dataflows are meant for self-service ETL, where the business users would be responsible for creating and managing the ETL flows. By contrast, ADF mapping data flows target BI Pros. If you have been following my blog, you know that I’m big proponent for the ELT pattern for various reasons, mainly better performance and avoiding tool dependencies. Despite how much Microsoft promotes ADF mapping data flows, my typical data integration projects don’t use them. The last time I looked at them, they didn’t support ADF self-hosted runtimes, and they are a pain to debug. But if you must do transformations on the fly, e.g. when dumping data into files, then you obviously don’t have a choice.

So, what did the client decide to do? IT decided to take over the Alteryx workflows and convert them to Azure Data Factory. So much about self-service ETL.

FastTrack Recognized Solution Architect 2022

Microsoft re-awarded me as a FastTrack Recognized Solution Architect – Power BI for 2022! This prestigious recognition is conferred by the Power Platform product engineering team for consistently exhibiting deep architecture expertise and creating high quality solutions for customers during project engagements. I’m one of the 36 individuals worldwide who met the following criteria:

  • Must have a minimum of 2 years of experience with Power BI and a minimum of 5 years of experience with Enterprise BI solutions
  • Must have a minimum of 2 years of experience as an Enterprise BI architect
  • Must be working for a partner with Gold certification in Data Analytics MPN competency
  • Must have been lead architect for at least 2 Power BI in-production implementations with at least 200 active users (Preferably for CAT managed customers)

Atlanta MS BI and Power BI Group Meeting on April 4th

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, April 4th, at 6:30 PM ET.  Michael Carlo (Power BI MVP) will show how to speed up your data modeling experience by using DAX templates. For more details and sign up, visit our group page.

Presentation:World’s Fastest DAX – Using Quick DAX Templates
Date:April 4th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:DAX is easy to Learn but hard to master.

This session is all about using the Growing Community Driven Library of Quick DAX Templates. Learn how to find the templates and use them in your daily workflow to drastically speed up your Data Modeling Experience.

Session Covers the following topics:

1. Introduces the concept of Templatized DAX

2. Consuming a Quick DAX Template using Power BI External Tools

3. Discusses how to create your own templates and use them

Multiple demos of working directly inside the tooling for consuming and creating Quick DAX templates.

Speaker:Michael Carlo is very passionate about data and analytics. He spent a massive amount of time learning Power BI. Through this learning process he felt that others may also be interested. Thus, he created a website, a knowledge repository for all things PowerBI (powerbi.tips). This site is uniquely set up for people to read and learn about data modeling and visualizations within Power BI.
Prototypes without PizzaPower BI Latest

PowerBILogo

5 Storage Gotchas for Power BI Dataflows

Over the past few years, the BI industry has come up with new file formats, such as Parquet, ORC, and Avro, which are widely used today. To facilitate its vision for cross-industry data integration, Microsoft introduced a few years ago the Common Data Model (CDM) and CDM Folders. Power BI dataflows output CSV files to CDM folders and each table is saved in its own folder. You can bring your own data lake to directly access these files. If do so, you’ll find the following folder structure:

Although accessing the dataflow files might open all sorts of data integration scenarios, here are some things to watch for concerning the dataflow output:

  1. If you plan to migrate from other self-service ETL tools, such as Alteryx, note that a Power BI dataflow can output only to a CDM folder and export the data as CSV.
  2. Each time the dataflow refreshes, a new snapshot file is generated and added to the <table>.csv.snapshots folder. Currently, dataflows don’t delete previous snapshots and there is no retention policy. You could consider this a feature that lets traverse the dataflow run history, but very quickly you might end up with lots of files.
  3. Working the snapshot files is cumbersome. For example, if you plan to load the data directly from the CSV files, you’d typically want to access the latest data and you’d want the file name to stay the same. However, as you can see in the screenshot, the file name includes the timestamp. So, if the tool doesn’t have a CDM connector, it must sort the files in the folder and load from the top file.
  4. The metadata (column names and data types) is stored in the model.json file. Continuing the previous example, you won’t get the column headers and types if you just load the snapshot file.
  5. Very few tools today support CDM folders. To support them, the tool must first query the model.json file to determine the location of the latest snapshot. The tool must then apply the data types from model.json. The Azure Data Lake Gen 2 connector in Power BI support them, but it’s been in a perpetual Beta. Azure Data Factory supports CDM, but it requires mapping data flows that I typically try to avoid. Microsoft has a Databricks package that understand CDM. I see that Informatica has put up a connector. This is a timid response from the industry considering the Microsoft’s ambitious vision. Even Microsoft’s own Synapse Serverless doesn’t support them yet. Why didn’t Microsoft decide to use modern and established file formats, such as Parquet, that can save both the metadata and data in the same file? It’s clear that Microsoft opted for the lowest denominator that every tool supports, which is CSV. But because CSV files don’t include the metadata, Microsoft had to find a way to provide it.

As a workaround for these limitations, consider implementing Power BI datasets that wrap the dataflows using the Dataflow connector. If you use Power BI Premium or PPU, enable DirectQuery in the enhanced compute engine, so that you don’t have to import the data (the only data connectivity supported by the ADLS connector is import) and you don’t have to refresh both the dataset and dataflow. Looking forward, I’d like to see dataflows supporting output settings, such as retention, immutable file name for the latest snapshot. Dataflows should add an output connector to define where the data should be sent, such as to output to Parquet files to store both the schema and metadata in the file or output to a relational database. I’d like also to see Synapse Serverless extended to support CDM folders.

In summary, yes, you can directly access the dataflow raw files in your own lake, but as it stands, the CDM folder implementation limits your data integration options.

Referencing Columns in DAX Table Variables

Suppose you use a DAX table variable, such as to group by certain columns and add an extension column as a calculation. Then, you want to count the rows in the table by filtering on one of the columns. At your first attempt, you might try using CALCULATE.

VAR _t = ADDCOLUMNS(SUMMARIZE(…), "SomeColumn", <some expression>)
RETURN
CALCULATE(COUNTROWS(_t), _t[SomeColumn] = something)

You’ll get an error that the column you reference cannot be found. To get this to work, you must use FILTER (notice that code uses [SomeColumn] instead of the column fully qualified name (_t[SomeColumn).

VAR _t = ADDCOLUMNS(SUMMARIZE(…), "SomeColumn", <some expression>)
RETURN
COUNTROWS(FILTER(_t, [SomeColumn] = something)

Why doesn’t CALCULATE work? CALCULATE changes the filter context, but DAX filter context only allows columns in the model. It doesn’t allow “extension” columns created in a DAX expression. [SomeColumn] does not exist in the model but only in the DAX expression and therefore is not visible by CALCULATE.

On the other hand, FILTER resolves [SomeColumn] in row context. [SomeColumn] is bound to the column in table expression _t and it can be resolved.

Speaking of grouping, you might have noticed that Power BI Desktop uses SUMMARIZECOLUMNS in autogenerated DAX queries instead of ADDCOLUMNS(SUMMARIZE()). However, if you attempt to use SUMMARIZECOLUMNS in a measure, you’ll get an error “SummarizeColumns() and AddMissingItems() may not be used in this context”. How come?

SUMMARIZECOLUMNS is not supported in measures because it was designed for resultset-producing, autogenerated DAX queries. It wasn’t enabled for measures because it has different semantics when filters exist in the filter context, as in the measure scenario, and when filters are passed to it as arguments, as in the DAX query scenario. If you want to use SUMMARIZECOLUMNS in a measure to replicate the query results of a visual, you may not get the expected results due to the different ways the function treats filters from different sources and so it can’t be used in measures.

Data Integration Options with Cloud Providers

It’s unlikely to envision a data analytics solution without having to ingest data from some cloud vendor. I previously ranted here about cloud prohibitors – cloud providers that negate the benefits of moving to the cloud by enforcing all sorts of hoops to get to your data. This blog summarizes a few data integration options with cloud vendors from best to worst.

  1. Direct access to the database – This is by far your best option. If the provider allows you to access the data by directly connecting to its native storage, you should take it even if it involves higher fees, such as upgrading to a “premium” tier (a travesty considering that you didn’t have pay higher fees with their on-prem installations). There are no insurmountable engineering barriers for the vendor to provide direct access to the data in its native storage, but very few cloud vendors do it citing “issues”, such as security, impact on operational processes, etc. Here is a free piece of advice to cloud vendors: provide direct access to the underlying relational database and use this as a big differentiator against the competition. Security and performance should be on you and not the customer.
  2. Data staging – If the direct access is not an option, the second best would be for the vendor to stage the data out, ideally to a relational database you provision, such as by using the Dynamics Data Export add-in to export data from Dynamics Online (now regretfully deprecated by Microsoft in favor of staging to data lake). Unfortunately, it looks like the norm nowadays is to export to a data lake as flat files and the extraction path becomes relational database -> flat files -> relational database. I hope that makes sense to you because it doesn’t to me. Things to watch out here for are where is the data lake located (Azure, AWS, others) and what integration options are provided. For example, only major ERP vendor supports only S3 as a data lake and provides and a JDBC driver to connect it (JDBC is a Java-based connector that can’t be used by any Microsoft-based integration tool.
  3. Data push as flat files – Smaller and more flexible cloud providers might be willing to push the data to a storage that you provide. In such cases, you should strongly consider Azure data lake instead of FTP. For example, one cloud provider I integrated with used Ruby to call the Azure API to post data extracts to the client’s data lake storage that we set up.
  4. REST APIs – Most cloud integration scenarios will fall in this bucket so that the vendor ensures “secure” and “controlled” access to data. The reality is that many of these REST APIs are horrible in both implementation and throughput. For example, the REST APIs of another ERP vendor couldn’t handle a batch export of 500,000 rows (a dataset that can fit into an Excel spreadsheet and be emailed around mind you). Their server would time out because of the “massive” data, and the client was asked to use callback APIs to chunk the export which of course wouldn’t either. And complexity only grows from here as some implementations require result paging, error handling, etc. forcing you to write custom code.
  5. Manual export – If the above options are not possible, your last resort might be to use the vendor app and export the data manually. So much about automating the data integration…

In summary, when evaluating a cloud vendor, don’t forget to ask how you can access your data. Extracting data by connecting to its native storage (typically a relational database) is the best option for integrating with cloud data sources, as the data is in its native format, you can fold operations, such as filtering and sorting, you can extract data incrementally, or even build SQL views if the vendor allows it. Unfortunately, the norm nowadays is to force you to call the vendor’s API, so the extraction effort in your ETL is on you and not the vendor.

Atlanta MS BI and Power BI Group Meeting on March 7th

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, March 7th, at 6:30 PM ET.  Elayne Jones (Consultant with BlueGranite) will cover the pros and cons of migrating semantic models from Azure Analysis Services to Power BI. For more details and sign up, visit our group page.

Presentation:Migrating Azure Analysis Services Models to PBI Premium
Date:March 7th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:Migrating Azure Analysis Services models to Power BI Premium simplifies resource maintenance and drives efficient business intelligence environments. Here, Elayne Jones covers the benefits, drawbacks, prerequisites, and available deployment options.
Speaker:Elayne Jones is a Staff Consultant at BlueGranite who brings experience with data analysis, visualization, and troubleshooting in the Transportation and Shipping industries. She graduated Summa Cum Laude from The University of North Georgia and brings her knowledge from multiple Microsoft certifications with her. Elayne is passionate about harnessing the transformative power of data to drive insights for businesses.
Prototypes without PizzaPower BI Latest by Teo Lachev

PowerBILogo

How Power BI Broke Amazon

Amazon started in 1994 as an online marketplace for books. You’d think that after being in this business for almost 30 years, they’ve figured out the book ecommerce inside out. Unfortunately, rapid revisions (I revise annually my Power BI book) break their model. Every new revision wreaks havoc in the customer experience to purchase a book and punishes the publisher. Here is what I’ve learned after having my books sold on Amazon for many years.

  1. Amazon has a proprietary algorithm that decides what to show on the book detail page. My books are available in two formats: paperback and ebook (Kindle). When a customer lands on the book page, the natural expectation is that the Kindle tab would redirect to the ebook format and the Paperback tab would redirect to the paperback format of the same edition. Unfortunately, the publisher has no control over what the Paperback tab would eventually link to. For example, in the screenshot below, today the algorithm has decided that the customer would be more interested to purchase the 5th edition paperback (7th edition is currently the latest) for the modest price of $79.82 because it’s a vintage. Not to mention that the 5th edition has been long unpublished and only available from third-party sellers that should’ve been listed on the “Other Sellers” tab.
  2. Every time a new revision is published, the book fights with itself. Despite the 7-year overall history and 75 reviews of my Power BI book, a new revision goes all the way back in the line, and its sales ranking starts from zero. Amazon doesn’t give any credit to prior revisions or reviews. And by the time the book has enough history to bubble up in the search results, it’s time for a new revision, and the cycle repeats itself.

I wrote this post mostly as an apology to my customers since there is nothing that be done to improve their Amazon book purchasing experience. The book is at the mercy of Amazon and its obsolete algorithms.

Power BI Dynamic M Query Parameters Reloaded

In a previous blog from October 2020, I ranted about the narrow list of data sources supported by Power BI Dynamic M query parameters. The February 2022 update of Power BI Desktop expanded the feature reach to relational data sources, such as SQL Server (all flavors), Oracle, Teradata, and SAP. This warranted a second look to see where we’re now and where we’re going.

Dynamic queries are now possible to popular relational data sources

The extended support makes it possible to execute SQL queries or stored procedures by passing parameters based on the user selection. I attach a sample pbix file that demonstrates this scenario. It calls a stored procedure uspGetManagerEmployeesTL in the AdventureWorks2021 database. The stored procedure is a slightly enhanced version of the original uspGetManagerEmployees sp included with AdventureWorks because I wanted to experiment with different scenarios around data security. The idea is to execute the stored procedure after each EmailAddress slicer change to see the employees reporting to that manager.

The following needs to happen to get this feature to work:

  1. You must configure the stored procedure to execute in DirectQuery mode when setting up the connection (the Employee table that I use for the slicer could be imported).
  2. In the case of calling a parameterized stored procedure in DirectQuery, you must wrap the call with OPENQUERY (OPENROWSET won’t work for Azure SQL Database) to avoid Power BI choking when running the report (<storedprocedure parameter> only actually works in Power Query but not when the report renders), as explained in more detail in my blog “Power BI DirectQuery with Parameterized Stored Procedure“.
  3. You must create a query parameter (Login parameter in my case).
  4. In the Model view, you must map the filtered field (in this case EmailAddress in the Employee table) to the Login parameter in the Advanced Properties.
  5. In Power Query, modify the stored procedure call to use the Login parameter.
 Source = Sql.Database("xps", "AdventureWorks2012", [Query="SELECT * FROM #(lf)OPENROWSET('SQLNCLI','trusted_connection=yes', 'exec AdventureWorks2012..uspGetManagerEmployeesTL ''" & Login & "''')", CreateNavigationProperties=false])

How about passing dynamically user context or DAX measure output?

If you read the original blog, I was after passing the user identity obtained from USERPRINCIPALNAME() as a parameter to the stored procedure to support the scenario where the client already has a database security in place. Unfortunately, we still can’t meet this scenario because of the other limitations with dynamic query parameters listed in the documentation. Specifically:

  1. Currently, there no way to pass results of DAX measures to Power Query. You might get innovative and try filtering the slicer using a DAX measure, such as the one below but it won’t work because it uses an “advanced” filter.
    EmployeeFilter = IF(SELECTEDVALUE(Employee[EmailAddress]) = USERPRINCIPALNAME(),1)
  2. The filtered field must be the only field in the context. For example, you can’t use cross-highlighting from another visual that lists another employee’s attribute, such as FullName, even if that maps 1:1 to the filtered field. The classic example where this could be useful would be to show the user-friendly name in a slicer but to pass the system key to the database. Sorry, we’re not that advanced yet although SSRS could do this 20 years ago…

In summary, you can now execute parameterized dynamic SQL statements or stored procedure against traditional relational data sources in DirectQuery by allowing the user to select the parameter value mapped to a dynamic query parameter. However, existing limitations prevent more flexible integration scenarios, such as passing DAX measures to the data source. Unfortunately, Power Query and the model continue to act as two tenants in common that don’t care much about each other’s whereabouts.