Atlanta MS BI and Power BI Group Meeting on November 1st

Please join us online for the next Atlanta MS BI and Power BI Group meeting today (Monday, October 4th), at 6:30 PM ET.  Sandeep Pawar will explain how to use the Power BI AI visuals for predictive insights. And your humble correspondent will show you how to use the Power BI REST APIs.  For more details and sign up, visit our group page.

Presentation:Demystifying Power BI AI Visuals
Date:November 1st
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:Power BI has several powerful AI visuals that allow business analysts to create insightful reports that include predictive capabilities without writing any code. In this session, we will take a deeper look at these visuals, discuss how exactly they work, when & how you should use them effectively and importantly when not to use them. We will look at the algorithms driving them and understand how to use them in your reports. We will look at forecasting, key influencer visual, clustering, decomposition tree, anomaly detector in detail. We will also look at how to validate the outputs of these visuals.
Speaker:Sandeep Pawar is a data science professional. He currently works at Cree Lighting, WI as a Data Analytics engineer. He has experience creating data analytics solutions using BI and ML tools.

PowerBILogo

PolyBase Adventures

I’m setting SQL Server 2019 PolyBase for ODBC to JDBC access to a vendor data lake to virtualize entities as SQL tables. Overall, a smooth experience with a few gotchas:

Data type mappings

The vendor lake uses Oracle data types TIMESTAMP AT TIME ZONE and BOOLEAN that Java doesn’t know how to map. The solution was to set up a view in the data lake (luckily the vendor supports that) to cast these data types to NVARCHAR and INTEGER.

NullPointerException

Once the table is finally set up what do we get when querying it?

105082;Generic ODBC error: java.lang.NullPointerException .

How do we fix this horrible issue? Upgrade SQL Server and PolyBase to the latest cummulative update (CU).

The final mystery that I haven’t been able to crack yet is that for some obscure reason, PolyBase adds quite a bit of performance overhead to the query execution. So, if a query in DBeaver directly connected to the lake (or Power BI Desktop directly corrected to the ODBC driver) takes eight seconds, PolyBase expands it to a minute. Examining the DMVs shows that the actual query does execute in line with DBeaver, but there is some additional overhead from PolyBase that would require a support case with Microsoft.

Atlanta MS BI and Power BI Group Meeting on October 4th

Please join us online for the next Atlanta MS BI and Power BI Group meeting today (Monday, October 4th), at 6:30 PM ET.  Reda Raz (RADACAD) will share best practices on semantic modelling with Power BI. And your humble correspondent will show two new features: Get Insights and Power BI Goals.  For more details and sign up, visit our group page.

Presentation:Power BI Modeling 101
Date:October 4th
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:Getting started with a report in Power BI is easy. However, soon you will face challenges of having multiple tables, the relationship between tables, the direction of relationship, active or inactive relationship and so on. You also soon realize that you need a proper data model called star-schema which is combination of fact and dimension tables. But, wait a second, you never learned all these fundamentals. What should you do? This session is build exactly for you, to help you understand the fundamentals of Power BI modelling and start from a good foundation.
Speaker:Reza Rad is a Microsoft Regional Director, an Author, Trainer, Speaker, and Consultant. He has a BSc in Computer engineering; he has more than 20 years of experience in data analysis, BI, databases, programming, and development mostly on Microsoft technologies. He is a Microsoft Data Platform MVP for 11 continuous years (from 2011 till now) for his dedication in Microsoft BI. Reza is an active blogger and co-founder of RADACAD. Reza is also co-founder and co-organizer of Difinity conference in New Zealand, and the Power BI Summit (the biggest Power BI conference)

PowerBILogo

Drillthrough Paginated Reports

I’m helping a client convert a few SSRS reports from SharePoint to Power BI Premium Per User (PPU). SSRS is of course near and dear to my heart because of all the work I’ve done around it circa 2004-2010 (yep, it’s been that long), books, MVP awards, etc. Since its humble beginnings, SSRS have had a solid architecture that excelled in extensibility. You’d be hardly pressed to face a requirement that couldn’t meet with SSRS back then.

Unfortunately, most of these extensibility features, such as custom assemblies, custom security, custom delivery extensions, custom renderers (essentially everything related to custom code) didn’t make it to paginated reports in Power BI Premium. Not many companies are using these features, so they probably won’t be a showstopper for your migration. To their credit, Microsoft is closing the gap between SSRS and paginated reports. As of now, the feature limits that you might run into are:

Missing FeatureWorkaround
Shared data sources/datasetsReport-specific (embedded) data sources/datasets (yep, a maintenance nightmare)
Drillthrough report actionsChange to URL action and provide URL to drillthrough reports passing filters on the URL
Document mapNo workaround

Instead of drillthrough report actions, implement URL-based actions. Assuming embedding for your organization, here are the high-level steps:

  1. Deploy the drillthrough report to Power BI. Again, you must deploy to a Premium or PPU workspace. Run the report. It should run successfully. Go to File, Embed menu and copy the iframe code (assuming you want to embed the report in your company’s portal). The iframe code should look like this:
    <iframe width="800" height="600" src="https://app.powerbi.com/rdlEmbed?reportId=b82ed928-d9d9-42b5-b6cc-a6e8f3e9dc4d&autoAuth=true&ctid=<your tenant id>" frameborder="0" allowFullScreen="true"></iframe>
  2. Open the main report in Visual Studio or Power BI Report Builder. In the report properties, Code tab, define a public constant to the drillthrough report(s). Or, you can define internal parameters.

  3. Now find all instances of drillthrough actions in the main report. If there are many, it might be easier to open the report RDL in Visual studio and search for <Drillthrough>. For each textbox with drillthrough replace the Drillthrough section in the <ActionInfo> element with a Hyperlink, such as
<ActionInfo>
 <Actions>
 <Action>
 <Hyperlink>=String.Format("{0}&amp;rp:ProfileID={1}&amp;rp:Month1={2}&amp;rp:Month2={3}", Code.reportMigrationDrillthroughUri, Parameters!ProfileID.Value, Parameters!Month1.Value, Parameters!Month2.Value)</Hyperlink>
 </Action>
 </Actions>
</ActionInfo>
  1. In the example above, the positional string replace will replace the {0} placeholder with the report drillthrough constant you defined before. Then, you provide values for the drillthrough report parameters (in this case, the drillthrough report takes three parameters).
  2. Test the main report and drillthrough links. If all is well, publish the main report to Power BI. Obtain its embedded iframe code and add it to your app page.

Use the secure embed iframe URL for the main report so that it’s embedded in the app instead of opening in a new browser tab. Unfortunately, the drillthrough reports will open in a separate tab and I couldn’t find a workaround to render them inside the iframe of the main report.

Chasing SSAS Connection Timeouts

Suppose you have a Tabular model and you send a massive DAX query to it that could run for hours, such as to calculate many measures (in our case hundreds) for each customer overnight so that you can cache the results and delight the user with super-fast lookups. This issue could also apply to Multidimensional although in this case Tabular was used. The server times out sporadically the query after a random execution time. You have changed all possible connection timeout options (SSAS ServerTimeout, SSIS connection timeout, etc.) to no avail. In fact, if you have scheduled an Agent job that calls an SSIS package that executes the query, the package doesn’t register the exception and continues executing indefinitely, but a Profiler trace (or XEvents session) shows that the server raises a Connection Timeout error.

How to fix this horrible issue? Change the two undocumented settings in the MSMDSRV.INI file from 60,000 to 600,000.

<ServerSendTimeout>600000</ServerSendTimeout>

<ServerReceiveTimeout>600000</ServerReceiveTimeout>

Azure Data Factory is Getting Better All the Time

Three years ago, I wrote that it would probably take a decade for to mature and close the gap with SSIS.

To its credit thought, while it’s still lagging in the area of extensibility, ADF added features that we don’t have in SSIS so I’m developing a taste for it:

  1. Schema drift – Suppose you want to automatically stage new columns as they added to a source table. Or, columns might be deleted from the source but your ETL shouldn’t fail. You can’t do these things with SSIS which is tightly coupled with the data source schema. ADF data flows, however, can handle this.
  2. Parallel loops – Want to loop through some tables but load them in parallel? The ADF ForEach loop can be parallelized up to 50 concurrent threads.
  3. Source partitioning – Let’s say you have a big source table and you want to speed up staging. You can configure the Copy Activity to automatically create multiple threads to load the table in parallel, such as by partition (if the source table is already partitioned) or by buckets based on a primary key.
  4. Scalability – As I mentioned in the blog above, this was the main driver for Microsoft to start from scratch with ADF. You’d be hard pressed to run out of resources with ADF, so it wins hands down against SSIS in this area.
  5. Managed VNET on a horizon – Currently in preview, you can ask your helpful System Services to setup a managed virtual network and thus avoid the need to set up self-service runtimes for accessing on-prem data sources. Make sure you install a Linux VM as the tutorial demonstrates since we found the hard way that Windows won’t work.

ADF is not without idiosyncrasies of course. Why would data flows have a separate expression language and require a separate runtime? Speaking of data flows, I almost used them for their automatic file partitioning until I found that I don’t have the control over the partition names to support both full and incremental load. Long live the copy activity!

Atlanta MS BI and Power BI Group Meeting on August 2nd

Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, August 2nd, at 6:30 PM.  Avi Singh (LearnPowerBI.com) will discuss how you can achieve a successful Power BI career.  For more details and sign up, visit our group page.

Presentation:How to Create a Successful Power BI Career Without the Struggle (By “Niching Down”)
Date:August 2nd
Time:6:30 – 8:30 PM ET
Place:Click here to join the meeting
Overview:Your Grandma was right! You cannot please everyone. And when you go out there and wave your flag as a Power BI Professional, that’s exactly what you are trying to do. And it doesn’t work. Either you get no results or have to work really hard for every inch of progress. The problem is that most professionals either miss or mess-up the first crucial step of their Power BI Career – Niching Down! In this session we’d talk about:

– Why Niching down is the crucial first step towards a successful Power BI Career

– How you can come up with the Niche that you should focus on

– Address some of the fears you may have about niching down (a classic one is: “Oh, but I don’t want to miss out on all the other opportunities out there by niching down”)

Speaker:If you are searching for a more meaningful work life as a Power BI Consultant ➔ Where you do what you love, create an impact by helping others and create a life of freedom for yourself ➔ Then Avi is dedicated in helping you achieve those goals.

Hint: The classic wisdom of “Learning More” and “Working Hard” actually does NOT help you get to these goals. Avi leads a five-step program for creating successful Power BI Consultants; starting with the first step…you guessed it 😉 “Niching Down”

PowerBILogo

Interactive Excel Pivots in Power BI

A long-standing limitation of Power BI has been that Excel pivot reports connected to external Analysis Services models can’t be interacted with when uploaded to Power BI Service. You get the cached pivot report but any attempt to interact with the report online (that is in Excel Online), such as to change a slicer or filter, would result in an error complaining that the connection can’t be refreshed and there hasn’t been a workaround. Today, Microsoft partially lifted this limitation by supporting interactive pivots connected to Power BI datasets. Unfortunately, pivots connected to external Analysis Services models (both on-prem and cloud) still don’t support interactivity.

For some reason, this feature is called “connected PivotTable refresh” although a better name would be “interactive Excel pivots finally”. It works in Power BI Pro and Premium. Unlike the initial announcement that speculated that this feature would be available by simply uploading your Excel file to Power BI Service, it appears that it requires the Excel file to be uploaded to OneDrive or SharePoint Online. Here are the high-level steps that worked for me:

  1. Create your Excel pivot using the Analyze in Excel feature or directly from Excel. Assuming you have a recent Office 365 build, the latter option allows you to open Excel on the desktop and click Insert->Pivot Table->From Power BI to connect to a Power BI dataset from within Excel Desktop.
  2. Click Save As and save your workbook to OneDrive or SharePoint Online. Recall that you can specify a SharePoint site for storing workspace files in the Power BI workspace settings if you don’t want to upload the file to your personal folder.
  3. Use the Power BI Service “Get data” feature to upload (not import) the Excel workbook in a Power BI workspace. Again, the Excel workbook must be located either in OneDrive or SharePoint online.
  4. Interact with the report in Power BI Service, such by drilling down, drilling through, or even rebinding the report to different fields! As a bonus, when the underlying dataset refreshes, your pivot report will show the latest data as it connects live to the dataset.

The “connected PivotTable refresh” is a much-needed enhancement that brings the same interactive features to Excel pivot reports that users previously had in SharePoint Server. Now users who favor Excel pivots for interactive exploration can share their reports and the interactive features would be preserved when the report is rendered online. The feature requires storing the Excel workbook in OneDrive or SharePoint Online and using an Analyze in Excel connection. Unfortunately, it doesn’t work with Analysis Services models that are external to Power BI (not deployed to Power BI Service) which is another incentive to move your semantic models to Power BI as I’ve been advocating for some time.

Cloud Inhibitors

In a previous blog, I said I consider it a travesty when a vendor’s cloud offering prohibits direct access to the data in its native storage, which is typically a relational database, and therefore a perfect store for data analytics and integration, such as to run real-time analytics or export the data to load a data mart. There is nothing simpler than connecting to a relational database, which is designed to handle efficiently select and filter operations. Unfortunately, it’s becoming a norm where the move to the cloud requires going through all sorts of hoops to access the data. Some enlightened providers offer direct access to the native store by charging a premium fee. However, this is not usually an option and most vendors require you call their REST APIs.

Since, the move to the cloud shouldn’t make things more difficult, I call these vendors “cloud inhibitors”. Such vendors usually cite concerns, such as security and performance, as reasons to prevent access. But the irony is that such issues could be easier addressed by the vendor, such as by hosting the customer on a dedicated VM (think of Power BI Premium), instead of transferring the data integration burden to the customer.

The previous “Cloud Inhibitor” award was held by Microsoft Dynamics Online that has decided to prevent access to the Dynamics SQL database and recommends the customer to consume the staged data from Azure Data Lake CDM folders. One popular ERP vendor has gone even further when embracing AWS as their platform of choice. Not only did they disallow access to their SQL Server database, but they also disallow access to the S3 data lake where they stage the data. Instead, when moving their on-prem ERP solution to the cloud, a client has found that they have two choices for data integration: call the REST APIs or use a JDBC driver, with the latter option out of reach for Microsoft apps.

I’m still researching the best option to enable this client access their data with minimum integration effort although the recommended solution by the vendor was to write a client app to call REST APIs or use Java-based ETL. How ridiculous is this given that that client could simply connect to their ERP database in the past? So much about real-time BI and no-code data integration!

Synapse Serverless: The Good, The Bad, and The Ugly

When you a provision a Synapse workspace, you get a serverless endpoint for free (or almost free). This endpoint represents Synapse Serverless: a query service for ad-hoc exploration of data in CVS, Parquet, and JSON files stored in Azure Data Lake.

The Good

Being able to query files using SQL is great. Some of you might remember the U-SQL language that was introduced a few years ago alongside Azure Data Lake Storage (ADLS) Gen 1 which is now deprecated. It never caught up because it looked like SQL but it wasn’t (it was actually closer to C#). Now we’re talking about real SQL. To query files! This opens the possibility to implement a logical warehouse (the emphasis is on logical as everyone to my knowledge who tried to replace a data warehouse with a data lake has failed). Or, you can connect Power BI to the serverless endpoint and start querying all these files in DirectQuery mode. So, this enables real-time BI on top of file extracts.

Behind the scenes, Synapse Serverless uses a distributed query engine to conquer and process files in parallel. For more information, read the “POLARIS: The Distributed SQL Engine in Azure Synapse” whitepaper. I ran some limited performance testing and I’m impressed. I’m yet to test joins between large tables (I’d love to hear from someone on this subject).

Pricing is based on the data processed but it’s only $5 per TB. So, export your data in *.parquet files which are smaller and faster (TIP: you can use Azure Data Factory to output Parquet files). Furthermore, Serverless will automatically update statistics for Parquet files (CVS files will require you to update statistics manually) which together with proper partitioning could improve performance for joins and filters so that Serverless doesn’t have to read all the files.

The Serverless feature that I like most though is the flexible schema.

Consider the following query, which defines the expected schema using the WITH clause. Will the query break if a schema drift is detected, e.g. some of the files have additional columns or the column names don’t match? Nope. The query still executes successfully, and if not specified in the WITH clause, the schema drift columns will come back empty. This opens the opportunity to do analytics on top of different schemas which is something you can’t do with your trusted SQL Server!

SELECT

     COUNT([Month]), COUNT([Month1])

FROM

    OPENROWSET(

        BULK‘https://dvdls01.dfs.core.windows.net/dvdlsfs01/google/*.csv’,

        FORMAT = ‘CSV’,

        PARSER_VERSION=‘2.0’,

        HEADER_ROW = TRUE

    )

    WITH (

        [Year] int,

        [Month] NVARCHAR(200),

        [Month1] NVARCHAR(200),

        [Day] int,

        [Ecommerce Conversion Rate] NVARCHAR(50),

        [New Users] int

        )

        AS [result]

The Bad

Currently, Serverless is limited to accessing data stored in ADLS only. I wish Microsoft extends Serverless to other vendors, such as Amazon (S3 and Athena) or Google. As a PaaS offering, there is no option to plug in additional drivers. The same limitation applies to the Synapse SQL Pool by the way. Adding the ORC files to the list of supported file formats will be a good addition, as well as the AVRO format which is generated by the Event Hub Capture.

You can find more crowdsourced improvement ideas at the Synapse Forum.

The Ugly

Now that you have a logical data warehouse, a common requirement is to get that data and load it into a physical data warehouse hosted in the Synapse SQL Pool. Unfortunately, the current architecture keeps the two pools separate (think of them as two different SQL servers). Views saved in a Serverless database are not accessible by the SQL Pool. The SQL Pool doesn’t support OPENROWSET either, so the approach outlined here won’t work. You must use Azure Data Factory or another ETL tool to stage the data in the SQL Pool if you follow the ELT pattern. I’d love to see the SQL Pool extended to support OPENROWSET as Azure SQL Managed Instance does to avoid yet another data movement and data staging.