Microsoft Dataverse: A Verse Without Rhymes

The cloud is supposed to make things easier, right? Well, not necessarily, as a client and I have recently discovered. They use Dynamics Online 365 and they are facing long refresh times for Power BI datasets that import data from Dynamics. Dynamics saves its data in an Azure SQL Database, which now goes by the name Dataverse (previously known as Common Data Service for Apps or CDS-A). I wrote two years ago about the pros and cons of CDS-A. The ugly award back then went to getting the data out of Dynamics. I wrote “For years people were complaining that after migrating from the on-premises Dynamics to the cloud, they lost the ability to connect to its database directly and they had to rely on the REST APIs (slow) or Data Export Service to export the data to an SQL Server Database (fast but requires additional effort and budget).”

Have things changed in the past two years? Yes, but not necessarily for better. Microsoft now has three ways of getting the data out of Dynamics (staging to Azure SQL Database has been deprecated).

OptionProsCons
Connect to Dynamics REST APIsAbility to pass predicates on the REST API call, such as column lists, joins, and filtersSlow, no query folding
Staging to ADLSMicrosoft preferred approach, automatic data synchronizationNo query folding, no DirectQuery, not enough compute resources
DTS endpointImport and DirectQuery modes, query folding worksNot enough compute resources, not a strategic option

Let’s take a more in-depth look at the new options: staging to ADLS and the DTS endpoint.

Staging to ADLS

Microsoft invests heavily and recommends staging the Dynamics data out to Azure Data Lake Storage Gen 2 (ADLS). Once you set up your data lake, you go to the Power Platform Admin Center and specify which entities you want to stage out. Once the initial snapshot completes, changes to Dynamics entities are automatically propagated in almost real time. From there, you can use the Power BI “Azure Data Lake Storage Gen 2” connector (make sure to expand the “cdm” folder) to import the data. Microsoft explains the process in more detail in this video. What a clumsy solution though! From a perfect store (SQL Server) that supports querying, joining, query folding, etc, we stage the data out and save it as flat files and lose all of these. You would think that the Power BI refreshes will be blazingly fast. After all, we read from a bunch of CSV files. Unfortunately, I ran some tests to import an entity that had around 1 million rows and I had to cancel it after one hour. By comparison, the original query that used the REST API clocked at 30 minutes. How come?

The moment you stage data to flat files you embrace a very inefficient way of consuming that data. Since there is no “server”, Power Query can’t pass predicates to the data source. So, if you have an entity with a million rows and you need only a few columns and a subset of rows, Power Query would load all the data from the data lake before it applies the predicates.

As we’ve found, ADLS will not cut through this much data as it does not have the compute and Power Query itself does not bring enough compute for large datasets and joins, etc. Microsoft’s solution? Throw in more compute by using the Synapse SQL on-demand connector so the architecture now becomes Dynamics -> ADLS -> Synapse (OD) -> Power BI! Of course, this will entail an additional investment on your part in the “compute” layer, but this is the price to pay when you want your data fast, right?

Consider the ADLS staging only for small entities unless you want to invest in Synapse. I personally decided against pursuing this path as it’s an overkill for the simple request to get access to the data.

TDS Endpoint

I rejoiced when I recently learned that we know have a TDS endpoint to Dataverse. TDS is the native protocol that SQL Server uses to communicate with the client app. The TDS endpoint is essentially a wrapper on top of the Dataverse Azure SQL Database. And Power BI has a Dataverse connector that supports import and DirectQuery modes. Time to celebrate, right? This is exactly what we wanted since now we don’t have to worry about staging the data out and running out of compute thanks to the horribly inefficient way of loading data from ADLS. We now have real-time access to data and Power Query can pass predicates to SQL Server.

Unfortunately, no luck importing that entity as the refresh timed out. As it turned out, TDS endpoint is meant for real-time reporting over relatively smaller (no definition from Microsoft about “smaller”) datasets. With large datasets and joins involved queries would require even more compute and processing time. So, back to square one. We don’t have enough compute there too and there is no workaround. And Microsoft doesn’t plan to invest in the TDS endpoint since staging to ADLS is their preferred solution.

I failed to understand what prevents Microsoft from assigning more “compute” to the TDS endpoint. Surely, customers spending millions in licensing fees deserve an option for a dedicated environment with enough compute like Power BI offers with premium capacities. If the concern is that direct access can impact production loads, this is something the client should worry about and address, such by connecting to the read-only replica of the Azure SQL Database, which I’m sure it’s privately available.

I always consider it a travesty when SaaS apps don’t give you access to the data in its native storage, which usually is a relational database, and therefore a perfect store.  Join my quest to persuade Microsoft in providing direct access with enough compute to your data in the cloud as you have with their on-prem offerings.