Understanding Power BI Dual Storage
The July release of Power BI Desktop introduced composite models, which I wrote about in my blog “Power BI Composite Models: The Good, The Bad, The Ugly” Composite models make data acquisition much more flexible but also more complex. A Power BI table can now be configured in one of these storage modes:
- Import – The table caches the data. All Power BI features work.
- DirectQuery – The table has only the metadata while the data is left at the data source. When users interact with the report, Power BI generates DAX queries to the Analysis Services server hosting the model, which in turn generates native queries to the data source. DirectQuery supports limited Power Query transformations and DAX functions in calculated columns, as described in more details here. Besides Power Query and DAX limitations, DirectQuery and Live Connections have feature limitations, as I discussed in my blog “Power BI Feature Discrepancies for Data Acquisition“.
- Live Connection – This is a special case when Power BI connects directly to multidimensional data sources, such as Analysis Services, published Power BI Datasets and SAP Hana. The fact that Power BI doesn’t support DirectQuery against these data sources is the reason why composite models are not supported as well.
- Dual – This is a new storage mode that deserves more attention.
As the name implies, the dual storage mode is a hybrid between Import and DirectQuery. Like importing data, the dual storage mode caches the data in the table. However, it leaves it up to Power BI to determine the best way to query the table depending on the query context. Consider the following schema where all tables come from the same data source. Let’s assume the following configuration:
- FactInternetSales is imported
- FactResellerSales is DirectQuery
- DimDate and DimProduct are Dual
Power BI will attempt the most efficient was to resolve the table joins. For example, if a query involves FactInternetSales and one or more of the dual dimensional tables, such as DimDate, the query would involve the DimDate cache. However, if the query involves FactResellerSales, Power BI will pass through the join. That’s because it would much more efficient to let the data source join the two tables in DirectQuery as opposed to bringing all the FactResellerSales table at the join granularity and then joining to the DimDate cache. In fact, Power BI will force you to change related dimension tables to Dual when it discovers hybrid joins among tables in the same data source. The following storage mode configurations are allowed between any two tables participating in M:1 join from a single data source:
Table on Many Side Is In | Table on One Side Must Be In |
Dual | Dual |
Import | Import or Dual |
DirectQuery | DirectQuery or Dual |
Therefore, a conformed dimension table must be configured for a Dual mode to join a fact table in Import mode and a fact table in DirectQuery mode.
When Power BI Desktop detects an incompatible configuration, it may disallow the change. It may disallow the configuration because currently Power BI can’t switch from Import to any other storage mode.
Or, it may ask you to switch storage modes it if finds a way to perform the change on its own.
Dual mode is important for tables coming from a single data source. Heterogeneous joins spanning two data sources would always result in a M:M relationship where the Power BI Mashup Engine will resolve the join. Specifically, when the joined table cardinality exceeds an undisclosed by Microsoft threshold, Power BI would perform the join on its own. For example, if DimDate is imported from Data Source A but FactResellerSales is DirectQuery in Data Source B, Power BI would send a direct query to group FactResellerSales to B at the join cardinality (e.g. DateKey), get the data, and then join the tables internally.
On my question why Power BI doesn’t handle the dual storage mode on its own by selecting the most efficient join, Microsoft replied that there are two reasons to delegate this task to the modeler and to make the Dual storage configuration explicit:
- Like Import, Dual requires refresh, whereas DirectQuery doesn’t.
- Apart from being able to revert to DirectQuery mode, Dual is subject to the same restrictions as DirectQuery.
Therefore, the modeler needs to be aware that the “switch” may result in requiring a data refresh or may result in a data a loss.
The Power BI dual storage mode is important when creating 1:M joins between tables in different storage modes from the same data model and for the forthcoming “Aggregations” feature, which I’ll cover in another blog. Although you need to know how Dual works, you don’t need to remember rules as Power BI Desktop will detect incompatible configurations. Composite models, heterogeneous M:M relationships, dual storage mode, and aggregations are Power BI-only features. At this point, Microsoft hasn’t announced when they will be available in Azure Analysis Services or SSAS Tabular.