Azure Data Engineer Badge Acquired

Just passed the Microsoft DP-203 exam and added yet another badge to my certification portfolio. This certification is one of the requirements for partners to maintain Microsoft Gold Data Analytics Competency, so I’m glad to take this off my list.

Atlanta MS BI and Power BI Group Meeting on February 7th

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, February 7th, at 6:30 PM ET.  Your humble correspondent will show you different techniques to implement scorecards with Power BI. For more details and sign up, visit our group page. Download the presentation assets here.

Presentation:Implementing Scorecards
Date:February 7th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:Join this session to learn how to create scorecards measuring strategic objectives with Power BI. We’ll start by introducing you to balanced scorecards and KPIs. Then, I’ll compare different ways to assemble scorecards, including:

·       Conditional formatting

·       Analysis Services KPI

·       Power BI Goals

Demos will make it all clear. As a bonus, you’ll learn some black-belt modeling techniques, such as using Tabular Editor where Power BI Desktop falls short.

Speaker:Teo Lachev is a consultant, author, and mentor, with a focus on Microsoft BI. Through his Atlanta-based company Prologika (a Microsoft Gold Partner in Data Analytics and Data Platform) he designs and implements innovative solutions that bring tremendous value to his clients. Teo has authored and co-authored several books, and he has been leading the Atlanta Microsoft Business Intelligence group since he founded it in 2010. Microsoft has recognized Teo’s contributions to the community by awarding him the prestigious Microsoft Most Valuable Professional (MVP) Data Platform status for 15 years. In 2021, Microsoft selected Teo as one of only 30 FastTrack Solution Architects for Power BI worldwide. https://prologika.com
Prototypes without PizzaPower BI Latest

PowerBILogo

DirectQuery Performance Progression

So much data, so little budget! Caching data in Power BI gives you the best report performance, but budget constraints usually put downward pressure to stay within lower Power BI Premium premium plans. So, what to do?

  1. Split large models. Remember that Power BI Premium Gen2 grants each dataset a 25 GB memory quota. I’m actually a big proponent for consolidated organizational semantic models, but this is where best practices meet reality.
  2. Switch large tables to DirectQuery. When report performance sucks, follow this performance optimization progression:
    1. Add a columnstore index to the fact table – As VertiPaq, a columnstore index organizes data in columns so aggregate queries should see an immediate performance boost. Detail-level queries, e.g. sales by customer, not so much as they probably won’t hit the index.
    2. Try hybrid tables when you can get away with a compromise where the latest data can be cached, but archive data left in DirectQuery.
    3. Try aggregation tables (first automatic, then manual) when you need to speed up aggregate queries at a higher-level grain, such as dashboard queries but leave lower-level queries pass through.
    4. Pray for mercy to come! Or take the stakeholder out for lunch and insist that the time for some compromise has come…

Azure Learning Path

I’m preparing for the new Azure Data Engineer exam, which is one of options to maintain a Gold Certification for Data Analytics for MS partners that has to be renewed annually. I’m really impressed by the Azure Learning Path self-paced environment that Microsoft put together. The training material is top-notch, and I actually learn some practical and useful stuff from the covered topics on ADLS, ADF, Synapse, etc. And you can even practice in an Azure sandbox environment that Microsoft sets up and tears down for you! It’s clear that tremendous effort has gone into setting this online and free learning option. No excuse to not certify anymore …

Check it out the online learning option for exam DP-203 at https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer/.

Applied Power BI Book (7th Edition)

Happy New Year!

I’m excited to announce the availability of the seventh edition of my Applied Microsoft Power BI book! When the first edition was published  in January 2016, it was the first Power BI book at that time and it had less than 300 pages. Since then, I helped many companies adopt or transition to Power BI and taught hundreds of students. It’s been a great experience to witness the momentum surrounding Power BI and how the tool has matured over time. As a result, the book also got thicker and doubled in size. However, I believe what’s more important is that this book provides systematic, yet dependent, view by showing what Power BI can do for four types of users (business users, analysts, pros, and developers).

For the first time, the book features a Kindle PDF (Amazon Print Replica) ebook version which replaces the Kindle native “reflow” format previously used. Now if you buy the ebook, you will read it on your Kindle device or phone exactly as it was formatted to print the paperback! The ebook is also enrolled in Kindle Unlimited – an Amazon subscription program where you can read as many eBooks as you like and keep them as long as you want.

To my understanding, this is the only Power BI book that is revised annually to keep it up to date with this ever changing technology! Hint: you can tell how old Power BI is from the book’s edition number.

Because I had to draw a line somewhere, Applied Microsoft Power BI (7th edition) covers all features that are in preview or released by December 2021. As with my previous books, I’m committed to help my readers with book-related questions and welcome all feedback on the book discussion forum on the book page. While you are there, feel free to check out the book resources (sample chapter, front matter, and more). I also encourage you to follow my blog at https://prologika.com/blog and subscribing to my newsletter at https://prologika.com to stay on the Power BI latest.

Bring your data to life for the seventh time! Keep on reading and learning!

Atlanta MS BI and Power BI Group Meeting on January 3rd

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, January 3rd, at 6:30 PM ET.  Paul Turley will show us how to integrate Power BI with paginated (SSRS) reports. And your humble correspondent will update you on the Power BI latest.  For more details and sign up, visit our group page.

Presentation:Power BI Paginated Reports: The New Old Operational Reporting Platform
Date:January 3rd
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:Power BI Paginated Reports (aka SQL Server Reporting Services) was old but now it’s new again. Available on-premises or in the Power BI service with flexible licensing, you have multiple options to implement operational reports. This session will briefly cover the differences between analytic and operational reports; and help you understand the advantages and trade-offs using Power BI Paginated Reports, Power BI Report Server and SQL Server Reporting Services. Material from our forthcoming book: Paginated Report Recipes.
Speaker:Paul is a Principal Consultant for 3Cloud Solutions (formerly Pragmatic Works), a Mentor and Microsoft Data Platform MVP. He consults, writes, speaks, teaches & blogs about business intelligence and reporting solutions. He works with companies around the world to model data, visualize and deliver critical information to make informed business decisions; using the Microsoft data platform and business analytics tools. He is a Director of the Oregon Data Community PASS chapter & user group, the author and lead author of Professional SQL Server 2016 Reporting Services and 14 other titles from Wrox & Microsoft Press.  Paul is a 2021 FastTrack Recognized Solution Architect and holds several certifications including MCSE for the Data Platform and BI.
Prototypes without PizzaPower BI Latest

PowerBILogo

Power BI Hybrid Tables

A while back a client wanted to avoid importing a large snapshot fact table with loan balances because its memory footprint would require them to upgrade to a higher Power BI premium plan. This of course required leaving the table in DirectQuery mode at the expense of query performance. Luckily, most users would be interested in the latest six months of data. To speed up performance, we opted for aggregations. However, to complicate things further, they had M2M relationships between dimensions and the fact table which Power BI aggregations don’t support. So, we had to roll out our own “aggregation hits” by redirecting DAX measures either to the aggregated table if the as-of date was in the last six months or to the DirectQuery table otherwise.

Seasoned BI pros might recall that Multidimensional supports measure groups with a mixed storage by creating MOLAP and ROLAP partitions within the same table. The recently announced Power BI hybrid tables carry this concept to Tabular models hosted in Power BI Premium or PPU. This enables two new scenarios:

  • Implement real-time “hot” partitions – For best report performance, you can continue importing and refreshing the data periodically, but you can also implement a “hot” partition configured for DirectQuery to show the latest changes. This is the scenario supported by the incremental refresh policy discussed in the announcement.
  • Leave infrequently accessed data in DirectQuery – This is the scenario that I believe would inspire more interest to address the above requirement.

Implementing real-time partitions

The easiest way to implement a real-time partition is by defining an incremental refresh policy. All you must do is check the “Get the latest data in real time with DirectQuery” checkbox. This will add a DirectQuery partition to the end of the partition design created by Power BI. When the scheduled refresh runs, Power BI will refresh the historical partitions as it would normally do. However, all queries that request data after the scheduled refresh date (at the day boundary) will be sent to the data source. Consequently, your model will have new data that is inserted into the table. I suggest you also configure the report pages that show the real-time data for automatic page refresh (in Power BI Desktop, select the page and turn on the “Page refresh” slider) so that the visuals poll for data changes at a predefined cadence.

Leave infrequently accessed data in DirectQuery

Think of this scenario as the opposite of the real-time partitions because the frequently requested data is imported while the historical data is DirectQuery. Because Power BI Desktop doesn’t support custom partitions, you must use another tool, such as Tabular Editor or SSDT, to configure the partitions by connecting it to the published dataset via the XMLA endpoint (or working with a local *.bim file). If you prefer Power BI Desktop, another option could be to create the partitions in a published test or production model, use Power BI Desktop for development, and configure deployment pipelines to propagate the changes and preserve the partition design in the non-development environments. You’d probably need only two partitions:

  • Historical partition – Specify a SQL statement that queries the historical data with a WHERE clause that qualifies rows using a relative date, such as six months before the system date. Change the partition mode to DirectQuery.
  • Current partition – Specify a SQL statement that defines the slice for the frequently used data. Change the partition mode to Import.

Here are the high-level steps to implement the custom partition design with Tabular Editor assuming you would create the partitions in a published dataset. Note that the first step use Power BI Desktop. You can use Tabular Editor for all steps but it’s a more advanced tool so you might want to use something you’re more familiar with.

  1. Start by opening your dataset in Power BI Desktop. For best performance, drop all imported dimension tables that relate to the large fact table, reimport them in DirectQuery, and then change their storage mode to Dual. That’s because currently Power BI Desktop doesn’t let you switch from Import to DirectQuery and there is no workaround.
  2. Open the Power Query Editor. By default, every table will have one partition and if you used Power BI Desktop, the partition will be an M partition (meaning it will have a Power Query). Change the Source step of the first partition to use a custom query (click the gear icon next to the Source step and then enter your custom SQL statement as per you requirements. For example, if you want the default partition to import the last six months and you target SQL Server, the Source step might look like this:
    let
    Source = Sql.Database("XPS", "AdventureWorksDW2012", [Query="select * from FactResellerSales where OrderDate >= DATEADD(month, -6, GETDATE())"])
    in Source
  3. Publish the dataset to Power BI Service. Close Power BI Desktop and promise yourself never to use it again for that dataset (if you can’t live without it, see the aforementioned note about deployment pipelines). That’s because if you republish the model, PBI Desktop will nuke the custom partitions in the deployed dataset (nice work here Microsoft). Remember that the only partition design supported and preserved by Power BI Desktop is the incremental refresh policy but you can’t use incremental refresh for this scenario.
  4. Connect the Tabular Editor to the published dataset XMLA endpoint (File, Open, From DB). As a prerequisite, make sure to enable the XMLA Endpont for Read/Write in the capacity settings.
  5. Rename the first partition, e.g. ResellerSales-Import.
  6. Duplicate the partition and rename the second one ResellerSales-DQ. Change its Mode property to DirectQuery. Change its source query to slice the historical data.
    let
    Source = Sql.Database("XPS", "AdventureWorksDW2012", [Query="select * from FactResellerSales where OrderDate < DATEADD(month, -6, GETDATE())"])
    in Source
  7. Save (or deploy) the changes to Power BI Service and refresh the dataset.
  8. (Optional) Open SQL Server Profiler connected to the published dataset. User Power BI Desktop or Analyze in Excel to create a report that queries the fact table by date. Notice that when the date filter falls within the last six months, there are no events in the profiler because the query is answered by the Tabular cache. However, when the date filter is outside that period, the profiler shows a SQL SELECT statement because that data is left in the data source to reduce the model’s memory footprint at the expense of performance.

This is what your partition design should look like:

Hybrid tables go even further than composite models by allowing you to mix storage modes within a table and having partitions in Import or DirectQuery storage modes. They enable two scenarios: real-time “hot” partitions and leaving infrequently accessed data in DirectQuery data. What I would like to see Microsoft improve in future is a) let us switch from Import to DirectQuery although this might break calculated columns and Power Query transforms, b) extend Power BI Desktop to support custom partitions so that we can connect Tabular Editor as an external tool and change the partition design and save it in the pbix file, and c) extend Power BI Desktop to preserve the partition design on deploy.

Atlanta MS BI and Power BI Group Meeting on December 6th

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, December 6th, at 6:30 PM ET.  To finish the year at a high note, the famous Guys in the Cube (Patrick and Adam) will tell us how keep the data fresh in Power BI. And your humble correspondent will update you on the Power BI latest.  For more details and sign up, visit our group page.

Presentation:Keeping your Data Fresh in Power BI
Date:December 6th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:We all want our data refresh to happen quickly so the most current data is available for our reports. In this session we will walk you through options to configure refreshing your data but more importantly we will help with performance. We’ll look at how to identify bottlenecks and then how to optimize at different points to get the most out of your Power BI refresh.
Speaker:Patrick LeBlanc is a currently a Principal Program Manager at Microsoft and a contributing partner to Guy in a Cube. Along with his 15+ years’ experience in IT he holds a Masters of Science degree from Louisiana State University. He is the author and co-author of five SQL Server books. Prior to joining Microsoft, he was awarded Microsoft MVP award for his contributions to the community. Patrick is a regular speaker at many SQL Server Conferences and Community events.

Adam Saxton is just a guy in a cube doing the work! He is on the Power BI CAT team at Microsoft working with customers to help adopt Power BI. He is based in Texas and started with Microsoft supporting SQL Server connectivity and Reporting Services in 2005.

Prototypes without PizzaPower BI Latest

PowerBILogo

Power BI Bookmark Navigator – A Better Hack

As a report author, you are constantly pressed to fit more visuals into a single page. The November release of Power BI Desktop introduced the Power BI Bookmark Navigator, which simplifies the process of creating a tabbed interface, such as this one.

Since Power BI doesn’t support visual containers or a “menu” visual, you must resort to the awful hack of hiding and showing UX elements by bookmarking them. This reminds me of the beginning of my career as a developer where we didn’t have widgets and we had to hack our way through implementing a custom navigation “experience” by toggling visibility. Alas, this continues in the 21st century but at least the hack got simplified. To implement the tabbed interface:

  1. Add two (or more) overlapping visuals.
  2. Add two bookmarks (Bar Chart and Column Chart) that show and hide the appropriate visual. Don’t worry about hidden visuals impacting the report performance because Power BI doesn’t process them.
  3. Add the two bookmarks to a Tabbed Interface bookmark group.
  4. In Report View, go to the Insert ribbon, expand the Buttons menu, and then click Navigators, “Bookmark navigator”.

Currently, Power BI supports two navigators. The “Page navigator” adds a tabbed navigation menu with a tab for each report page to let the user navigate to a given page by clicking the corresponding tab. The navigator that will inspire more interest is the “Bookmark navigator”.

  1. Notice that by default the navigator adds a tab for each bookmark defined in the report, but in this case, you just need to restrict it to the two bookmarks that you previously created. With the navigator selected, expand the Bookmarks section in the “Format navigator” pane, and select the “Tabbed Interface” bookmark group.
  2. Position the navigator above the two visuals. Remember that in Power BI Desktop, you need to press Ctrl when you click that navigator tabs to switch between the visuals.

Limitations and bugs:

  1. The previously selected tab gets stuck in a highlighted state, so you must hover on it to make it appear “unselected”.
  2. Hierarchical navigation is not supported. For example, you might want to build a page navigation experience like in Power BI apps. However, you can’t define a hierarchy, such as to start the user at the bookmark group level and then drill down to bookmarks.
  3. Although you can somewhat customize the tab appearance, no UX designer will probably be impressed. For example, one feature that could be useful to free up more page real estate is to be able to toggle the navigator visibility.

“Serverless” Lessons Learned

I’ve architected and currently implementing a solution that uses Synapse (my last newsletter has the details, plus the architecture diagram). Synapse Serverless is the Microsoft answer to Amazon Athena but instead of using open-source tools like Presto, it’s built on SQL Server. In this project we extract many tables from 1,500 on-prem SQL Server databases and stage them in ADLS.

From there we use Synapse Serverless to virtualize these files as tables that we query with T-SQL to load the source “table” data into a data warehouse hosted in Synapse SQL Pool. I have to tell you that I’m becoming a “serverless” fan.

Here are a few lessons learned from this project:

  1. Save the files in parquet format in ADLS. Parquet can be compressed. It’s columnar based, it’s much faster to query. Serverless automatically creates statistics for parquet files on the first query and each time it detects changes.
  2. Less files result in better ETL performance – We compared the results of querying a virtual table that is based on 1,500 files (one file per database) vs. a single file (by sending a T-SQL SELECT…UNION ALL SELECT query) that combines the data from all databases for that table. The single file outperforms the many files by far. First, the ETL process is a way faster because ADF doesn’t have to queue each copy activity. So, even if the file is small and takes a few seconds to copy over, time quickly adds up so you might find that you have to scale up your ADF self-hosted runtime and increase parallelism in ADF loops. For example, uploading all these files would take an hour vs. 40 seconds for a single file.
  3. Less files results in better query performance – We observed similar results when querying a virtual table in Synapse Serverless. In the case where the table was virtualized on top of many files, it took about 15 seconds to count the rows in the table and even longer to execute a single WHERE clause. By contrast, a virtual table on top of a single file was almost instantaneous.
  4. Don’t be afraid of schema differences – The chances are that different databases may have slightly different schemas, such as data types mismatch or extra columns exist in some tables. A great feature of Synapse Serverless is that the columns of the virtual table are the superset of all possible columns in the source. If a file doesn’t include a column, an empty column is returned.

To make my joy complete, I hope at some point Microsoft would support native integration between SQL Pool and Serverless so we don’t copy the data over. Although SQL Server-based, currently SQL Pool and Serverless are two separate sources. In our case we had to use ADF to extract data from Synapse Serverless and stage it in the SQL Pool before the final transformation to the data warehouse.