Prologika Business Intelligence Consulting and Training in Atlanta 2024-12-14T15:55:02Z https://prologika.com/feed/atom/ WordPress Prologika - Teo Lachev <![CDATA[Prologika Newsletter Winter 2024]]> https://prologika.com/?p=9337 2024-12-14T15:55:02Z 2024-12-14T15:53:28Z computer memory with queries executing and Microsoft Fabric logo. Image 4 of 4

I conducted recently an assessment for a client facing memory pressure in Power BI Premium. You know these pesky out of memory errors when refreshing a biggish dataset. They started with P1, moved to P2, and now are on P3 but still more memory is needed to satisfy the memory appetite of full refresh. The runtime memory footprint of the problematic semantic model with imported data is 45 GB and they’ve done their best to optimize it. This newsletter outlines a few strategies to tackle excessive memory consumption with large semantic models. Unfortunately, given the current state of Power BI boxed capacities, no option is perfect and at end a compromise will probably be needed somewhere between latency and performance.

Why I don’t like Premium licensing

Since its beginning, Power BI Pro per-user licensing (and later Premium Per User (PPU) licensing) has been very attractive. Many organizations with a limited number of report users flocked to Power BI to save cost. However, organizations with more BI consumers gravitated toward premium licensing where they could have unlimited number of report readers against a fixed monthly fee starting at listed price of $5,000/mo for P1. Sounds like a great deal, right?

I must admit that I detest the premium licensing model because it boxes into certain resource constraints, such as 8 backend cores and 25 GB RAM for P1. There are no custom configurations to let you balance between compute and memory needs. And while there is an auto-scale compute model, it’s very coarse and it applies only to processing cores. The memory constraints are especially problematic given that imported models are memory resident and require more than twice the memory for full refresh. From the outside, these memory constraints seem artificially low to force clients into perpetual upgrades. The new Fabric F capacities that supersede the P plans are even more expensive, justifying the price increase with the added flexibility to pause the capacity which is often impractical.

It looks to me that the premium licensing is pretty good deal for Microsoft. Outgrown 25 GB of RAM in P1? Time to shelve another 5K per month for 25 GB more even if you don’t need more compute power. Meanwhile, the price of 32GB of RAM is less than $100 and falling.

It will be great if at some point Power BI introduces custom capacities. Even better, how about auto-scaling where the capacity resources (both memory and CPU) scale up and down on demand within minutes, such as adding more memory during refresh and reducing the memory when the refresh is over?

Strategies to combat out-of-memory scenarios

So, what should you do if you are strapped for cash? Consider evaluating and adopting one or more of the following memory saving techniques, including:

  • Switching to PPU licensing with a limited number of report users. PPU is equivalent of P3 and grants 100GB RAM per dataset.
  • Optimizing aggressively the model storage when possible, such as removing high-cardinality columns
  • Configuring aggressive incremental refresh policies with polling expressions
  • Moving large fact tables to a separate semantic model (remember that the memory constraints are per dataset and not across all the datasets in the capacity)
  • Implementing DirectQuery features, such as composite models and hybrid tables
  • Switching to a hybrid architecture with on-prem semantic model(s) hosted in SQL Server Analysis Services where you can control the hardware configuration and you’re not charge for more memory.
  • Lobbying Microsoft for much larger memory limits or to bring your own memory (good luck with that but it might be an option if you work for a large and important company)

Considering Direct Lake storage

If Fabric is in your future, one relatively new option to tackle out-of-memory scenarios that deserves to be evaluated and added to the list is semantic models configured for Direct Lake storage. Direct Lake on-demand loading should utilize memory much more efficiently for interactive operations, such as Power BI report execution. This is a bonus to the fact that data Direct Lake models don’t require refresh. Eliminating refresh could save tremendous amount of memory to start with, even if you apply advanced techniques such as incremental refresh or hybrid tables to models with imported data.

I did limited testing to compare performance of import and Direct Lake and posted detailed results in the “Fabric Direct Lake: Memory Utilization with Interactive Operations” blog.

I concluded that if Direct Lake is an option for you, it should be at the forefront of your efforts to combat out-of-memory errors with large datasets.

On the downside, more than likely you’ll have to implement ETL processes to synchronize your data warehouse to a Fabric lakehouse, unless your data is in Fabric to start with, or you use Fabric database mirroring for the currently supported data sources (Azure SQL DB, Cosmos, and Snowflake). I’m not counting the data synchronization time as a downside.


Teo Lachev
Prologika, LLC | Making Sense of Data
logo

]]>
0
Prologika - Teo Lachev <![CDATA[Handling Data Quality and Data Enrichment]]> https://prologika.com/?p=9343 2024-12-07T14:36:49Z 2024-12-07T14:36:49Z Some of the most difficult and tedious issues in almost every BI project are handling data quality issues (aka exceptions) and data enrichment. Examples of data quality issues typically originate from wrong data entries or violated business rules, such as misspelled and misclassified products. Data enrichment tasks may go beyond the design of the original data source, such as introducing new taxonomies for products or customers.

In a recent project, a client needed guidance on where to handle data quality and enrichment processes. Like cakes and ogres, a classic BI solution has layers. Starting upstream in the data pipeline and moving downstream are data sources, ODS (data staging and change data tracking), EDW (star schema) and semantic models.

In general, unless absolutely necessary, I’m against master data management (MDM) systems because of their complexity and cost. Instead, I believe that most data quality and enrichment tasks can be addressed efficiently without formal MDM. I’m also a big believer in tackling these issues as further upstream as possible, ideally in the data source.

Consider the following guidelines:

  1. There are different types of exceptions and enrichment requirements, and they range in complexity and resolution domain. Each case must be triaged by the team to determine how to best handle it by examining the data pipeline upstream to downstream direction and traversing the layers.
  2. As a best practice, exceptions and enrichments tasks should be dealt with as further upstream as possible because the further downstream the correction is made, the narrower its availability to consumers will be and more implementation effort might be required.
    1. Data Source – examples include wrong entries, such as wrong employe’s job start date or adding custom fields to implement categories requires for standard hierarchies. The team should always start with the source system in attempt to address data quality issues at the source.
    2. ODS – examples include corrections not possible in the upstream layers but necessary for other consumers besides EDW.
    3. EDW – examples include corrections applicable only to analytics, such as IsAttendance flag to flag which attendance events should be evaluated.
    4. Semantic layer – calculated columns for data enrichment best done in DAX and additional fields the end user needs in a composite model.
  3. Shortcuts – When resolving the change upstream is time prohibitive, it should be permissible to handle the change further downstream while waiting for changes to an upstream system. For example, if fixing an issue in the data source is rather involved, a temporary fix can be done in ODS to prototype the change and expedite it to consumers. However, the team will remove that fix once the source system is enhanced.
  4. When consumers prefer access to uncorrected data, a pair of fields could be introduced as upstream as possible, such as:
    1. EmployeeId – raw field
    2. EmployeeIdCorrected – corrected field. Joins to the related dimension table will be done on this field.

data quality issues

]]>
0
Prologika - Teo Lachev <![CDATA[LLM Adventures: Microsoft Copilot Studio (The Good, The Bad, and The Ugly)]]> https://prologika.com/?p=9322 2024-11-28T20:30:59Z 2024-11-28T20:30:59Z “A momentary lapse of reason
That binds a life to a life
You won’t regret, you will never forget
There’ll be no sleep in here tonight”
“The Slip”, Pink Floyd

Happy Thanksgiving! What a better way for me to spend a Thanksgiving week than doing more AI? After a year of letting it (and other Microsoft LLM offerings) simmer and awaken by the latest AI hoopla from the Ignite conference, I took another look at Microsoft Copilot Studio. For the uninitiated, Copilot Studio lets you implement AI-powered smart bots (“agents”) for deriving knowledge from documents or websites. Basically, you can view the relationship of Copilot Studio to Retrieval-augmented Generation apps as what Power BI is to self-service BI.

Copilot Studio licensing starts at $200 per month for up to 25,000 messages (interactions between user and agent) although at Ignite Microsoft hinted that pay-as-you-go licensing will be coming.

The Good

A few months ago, when I discussed RAG apps, it was obvious that a lot of custom code had to be written to glue the services together and implement the user interface. Microsoft Copilot Studio has the potential to change and simplify this. It offers a Power Automate-like environment for no-code, low-code implementation of AI agents and therefore opens new possibilities for faster implementation of various and specialized AI agents across the enterprise. I was impressed by how easy the process was and how capable the tool was to create more complex topics, such as conditional branches based on user input.

Like Power BI, the tool gets additional appeal from its integration with the Microsoft ecosystem. For example, it can index SharePoint and OneDrive documents. It can integrate with Power Automate, Azure AI Search and Azure Open AI.

I was impressed by how easy is to use the tool to connect to and intelligently search an existing website. For now, I see this as being its main strength. Organizations can quickly implement agents to help their employees or external users to derive knowledge from intranet or Internet websites.

To demonstrate this, I implemented an agent to index my blog and embedded it below for you to try it out before my free trial expires. Please feel free to ask more sophisticated questions, such as “What’s the author’s sentiment toward Fabric?”, “What are the pros and cons of Fabric?”. Or “I need help with Power BI budget” (I got innovative here and implemented a conditional topic with branches depending on the budget you specify). I instructed the tool to stay only within the content of  my website, so the answers are not diluted from other public sources. Given that no custom code was written, Copilot Studio is pretty impressive.

The Bad

Everyone wants to be autonomous and AI agents are no exception. In fact, “autonomous agent” is the buzzword of AI world today. Not to be outdone, Copilot Studio claims that it can “build agents that operate independently to dynamically plan, learn, and escalate on your behalf”. However, as the tool stands today, I don’t think there is much to this claim. Or it could be that my definition of “autonomous” is different than Microsoft’s.

To me, an autonomous agent must be capable of making decisions and taking actions on its own. Like you tell your assistant that you plan a trip, give her some constraints, such as how much to spend on hotel and air, and let her make travel reservations. As it stands, Copilot Studio offers none of this. It follows a workflow you specify. Again, its output is more or less a smarter bot than the ones you see on many websites.

However, at Ignite Microsoft claimed that autonomy is coming so it will be interesting to see how the tool will evolve. Don’t get me wrong. Even as it stands, I believe the tool has enormous potential for more intelligent search and retrieval of information.

The Ugly

My basic complaint as of now is performance. It took the tool 10 minutes to index a PDF document. Then in a momentary lapse of reason, I connected it to an Azure SQL Database with Adventure Works with 15 tables (the max number of tables currently supported) and it’s still not done indexing after a day. Given that many implementations would require searching the data in relational databases, I believe this is not acceptable. Not to mention there isn’t much insight on how far it’s done indexing or limit the number of fields it should index.

Therefore, I believe most real-world architectures for implementing AI agents will take the path Copilot Studio->Azure AI Search ->Azure Open AI, where Copilot Studio is used for implementing the UI and workflows (topics and actions), while the data indexing is done by Azure AI Search with semantic ranking in conjunction with Azure Open AI for embedded vectors.

 

]]>
0
Prologika - Teo Lachev <![CDATA[Atlanta Microsoft BI Group Meeting on December 2nd (Semantic Modeling as Code)]]> https://prologika.com/?p=9320 2024-11-26T18:51:45Z 2024-11-26T18:51:45Z Atlanta BI fans, please join us online for our next meeting on Monday, December 2nd at 5PM ET (please note the change to our usual meeting time to accommodate the presenter). Rui Romano (Product Manager at Microsoft) will discuss how the new TMDL language for Power BI models can unlock new scenarios that previously weren’t possible. For more details and sign up, visit our group page.

Presentation: “Semantic Modeling as Code” with TMDL using Power BI Desktop Developer Mode (PBIP) and VS Code
Delivery: Online
Level: Intermediate to Advanced

Overview: The landscape for developing enterprise-scale models has never been more exciting than it is now! Developer mode in Power BI Desktop and the new TMDL language unlock new scenarios that previously weren’t possible, such as great source control and co-development experiences with Git integration. Additionally, the TMDL Visual Studio Code extension offers a new, powerful and efficient, code-first semantic modeling experience. Join us to discover the new and powerful ways you can leverage TMDL to accelerate your model development and get a sneak peek into the TMDL roadmap from the Power BI product team.

Speaker: Rui Romano is an experienced Microsoft Professional with a deep passion for data and analytics. He has spent the last decade helping companies make better data-driven decisions and is known for his innovative and practical solutions to complex problems. Currently works as a Product Manager at Microsoft on the Power BI product team, focusing on Pro-BI experiences.

PowerBILogo

]]>
0
Prologika - Teo Lachev <![CDATA[Temporal Tables]]> https://prologika.com/?p=9313 2024-11-10T17:15:43Z 2024-11-10T17:15:43Z I like SQL Server temporal tables for implementing ODS-style tables and change data tracking (CDS) for three main reasons:

  1. SQL Server maintains the system versioning. By contrast, I have witnessed erroneous Start/End dates for pretty much all home grown implementation. Further, SQL Server grains the changes at millisecond level.
  2. There is a clean separation between the current state of data and historical data. SQL Server separates the historical changes to a history table.
  3. You can establish a flexible data retention policy. A retention policy can be established at database or table level. SQL Server take care of purging the expired data.

At the same time, temporal tables are somewhat more difficult to work with. For example, you must disable system versioning before you alter the table. Here is the recommended approach for altering the schema by the documentation:

BEGIN TRANSACTION
ALTER TABLE [dbo].[CompanyLocation] SET (SYSTEM_VERSIONING = OFF);
ALTER TABLE [CompanyLocation] ADD Cntr INT IDENTITY (1, 1);
ALTER TABLE [dbo].[CompanyLocation] SET
(
SYSTEM_VERSIONING = ON
(HISTORY_TABLE = [dbo].[CompanyLocationHistory])
);
COMMIT;

However, if you follow these steps, you will be greeted with the following error when you attempt to restore the system versioning in the third step:

Cannot set SYSTEM_VERSIONING to ON when SYSTEM_TIME period is not defined and the LEDGER=ON option is not specified.

Instead, the following works:

ALTER TABLE <system versioned table>  SET (SYSTEM_VERSIONING = OFF) -- disable system versioning temporarily
-- make schema changes, such as adding new columns
ALTER TABLE <system versioned table>  ADD PERIOD FOR SYSTEM_TIME (StartDate, EndDate); -- restore time period
ALTER TABLE [mulesoft].[Employee] SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [mulesoft].[EmployeeHistory])) -- restore system versioning

]]>
0
Prologika - Teo Lachev <![CDATA[Atlanta Microsoft BI Group Meeting on November 4th (Accelerating your Fabric Data Estate with AI & Copilot)]]> https://prologika.com/?p=9310 2024-10-29T15:57:30Z 2024-10-29T15:57:30Z Atlanta BI fans, please join us in person for our next meeting on Monday, November 4th at 6:30 PM ET. Stacey Jones (Principal Data & AI Cross-Solution Architect at Microsoft) and Elayne Jones (Solutions Architect at Coca-Cola Bottlers Sales and Services) will explore the AI and Copilot capabilities within Microsoft Fabrics. And I’ll help you catch up on Microsoft BI latest. I will sponsor the event which marks the 14th anniversary of the Atlanta Microsoft BI Group! For more details and sign up, visit our group page.

Details

Presentation: Accelerating your Fabric Data Estate with AI & Copilot
Delivery: In-person
Date: November 4th, 2024
Time: 18:30 – 20:30 ET
Level: Beginner to Intermediate
Food: Pizza and drinks will be provided

Agenda:
18:15-18:30 Registration and networking
18:30-19:00 Organizer and sponsor time (events, Power BI latest, sponsor marketing)
19:00-20:15 Main presentation
20:15-20:30 Q&A

Venue
Improving Office
11675 Rainwater Dr
Suite #100
Alpharetta, GA 30009

Overview: In this presentation, we will explore the groundbreaking AI and Copilot capabilities within Microsoft Fabric, a comprehensive platform designed to enhance productivity and collaboration. By leveraging advanced machine learning algorithms and natural language processing, Microsoft Fabric’s AI/Copilot not only streamlines workflows but also provides intelligent insights and automation, empowering users to achieve more with less effort. Join us as we delve into the features and functionalities that make Microsoft Fabric an indispensable tool for modern enterprises.

Sponsor: CloudStaff.ai

PowerBILogo

]]>
0
Prologika - Teo Lachev <![CDATA[Implementing Role-playing Dimensions in Power BI]]> https://prologika.com/?p=9303 2024-10-11T21:42:50Z 2024-10-11T21:42:50Z Role-playing dimensions are a popular business requirement but yet challenging to implement in Power BI (and Tabular) due to a long-standing limitation that two tables can’t be joined multiple times with active relationships. Declarative relationships are both a blessing and a curse and, in this case, we are confronted with their limitations. Had Power BI allowed multiple relationships, the user must be prompted which path to take. Interestingly, a long time ago Microsoft considered a user interface for the prompting but dropped the idea for unknown reasons.

Given the existing technology limitations, you have two implementation choices for implementing subsequent role-playing dimensions: duplicating the dimension table (either in DW or semantic model) or denormalizing the dimension fields into the fact table. The following table presents pros and cons of each option:

 Option Pros Cons
Duplicate dimension table in semantic model or DW

 

No or minimum impact on ETL

Minimum maintenance in semantic model

All dimension attributes are available

Metadata complexity and confusion

(potentially mitigated with perspectives that will filter metadata for specific subject area)

Denormalizing fields from into fact table Avoid role-playing dimension instances

More intuitive model to business users

Increased fact table size and memory footprint

Impact on ETL

Limited number of dimension attributes

Track visited dimension changes as Type 2 with incremental extraction (while it could be Type 1)

If applicable, inability to reuse the role-playing dimension for another fact table and do cross-fact table analysis

So, which approach should you take? The middle path might make sense. If you need only a limited number of fields for the second role-playing dimension, you could add them to the fact table to avoid another dimension and confusion. For example, if you have a DimEmployee dimension and you need a second instance for the person making the changes to the fact table, you can add the administrator’s full name to the fact table assuming you need only this field from DimEmployee.

By contrast, if you need most of the fields in the role-playing instances, then cloning might make more sense. For example, analyzing fact data by shipped date or due date that requires the established hierarchies in DimDate, could be addressed by cloning DimDate. Then to avoid confusion, consider using Tabular Editor to create perspectives for each subject area where each perspective includes only the role-playing dimensions applicable to that subject area.

Yet, a narrow-case third option exists when you only need role-playing measures, such as SalesAmountByShipDate and SalesAmountByDueDate. This scenario can be addressed by forcing DAX measures to “travel” the inactive relationship by using the USERELATIONSHIP function.

dimensions connected to a fact table in a database

 

]]>
0
Prologika - Teo Lachev <![CDATA[Prologika Newsletter Fall 2024]]> https://prologika.com/?p=9296 2024-09-13T18:06:32Z 2024-09-13T18:06:32Z Text to SQL copilot When it comes to Generative AI and Large Language Models (LLMs), most people fall into two categories. The first is alarmists. These people are concerned about the negative connotations of indiscriminate usage of AI, such as losing their jobs or military weapons for mass annihilation. The second category are deniers, and I must admit I was one of them. When Generative AI came out, I dismissed it as vendor propaganda, like Big Data, auto-generative BI tools, lakehouses, ML, and the like. But the more I learn and use Generative AI, the more credit I believe it deserves. Because LLMs are trained with human and programming languages, one natural case where they could be helpful are code copilots, which is the focus of this newsletter. Let’s give Generative AI some credit!

Text2SQL

I have to say that I was impressed with LLM. I used the excellent Ric Zhou’s Text2SQL sample as a starting point inside Visual Studio Code.

A screenshot of a chat Description automatically generated

The sample uses the Python streamlit framework to create a web app that submits natural questions to Azure OpenAI. I was amazed how simple the LLM input was. Given that it’s trained with many popular languages, including SQL, all you have to do is provide some context, database schema (generated in a simple format by a provided tool), and a few prompts:

[{'role': 'system', 'content': 'You are smart SQL expert who can do text to SQL, following is the Azure SQL database data model <database schema>},
{'role': 'user', 'content': 'What are the three best selling cities for the "AWC Logo Cap" product?'},
{'role': 'assistant', 'content': 'SELECT TOP 3 A.City, sum(SOD.LineTotal) AS TotalSales \\nFROM [SalesLT].[SalesOrderDe...BY A.City \\nORDER BY TotalSales DESC; \\n'},
{'role': 'user', 'content': natural question here'}]

Let’s drill into these prompts.

  1. The first prompt is an example of role prompting which provides context to the model for our intention to act as a SQL expert.
  2. The sample includes a tool that generates the database schema consisting of table and columns in the following format below. Notice that referential integrity constraints are not included (the model doesn’t need to know how the tables are related!)
  3. The next “few shot” prompt assumes the role of an end user who will ask a natural question, such as ‘What are the three best selling cities for the “AWC Logo Cap” product? This is followed by the assistant’s response who hints the model what the correct query should be.
  4. Then the request for new query follows.

You can find more Text2SQL implementation details in my post “LLM Adventures: Text2SQL”.

Text2DAX

As a Microsoft BI practitioner, the next natural stop was Text2DAX. But wait, we have a Microsoft Fabric Copilot already for this, right? Yes, but what happens when you click the magic button in PBI Desktop? You are greeted that you need to purchase F64 or larger capacity. It’s a shame that Microsoft has decided that AI should be a super-premium feature. Given this horrible predicament, what would an innovative developer strapped for cash do? Create their own copilot of course!

Building upon the previous sample, this is remarkably simple. First, I obtained the model schema using Analysis Services Data Management Views (DMVs). Second, I changed slightly the prompt, along these lines:

“As a DAX expert, examine a Power BI data model with the following table definitions where TableName lists the tables and ColumnName lists the columns, and provide DAX query that can answer user questions based on the data model, you will think step by step throughout and return DAX statement directly without any additional explanation.”

You can find more Text2DAX implementation details in my post “LLM Adventures: Text2DAX”.

In summary, it appears that LLM can effectively assist us in writing code. The emphasis is on assist because I view the LLM role as a second set of eyes. Hey, what do you think about this problem I’m trying to solve here? LLM doesn’t absolve us from doing our homework and learning the fundamentals, nor it can compensate for improper design. While LLM might not always generate the optimum code and might sometimes fabricate, it can definitely assist you in creating business calculations, generating test queries, and learning along the way.

BTW, you can use any of the publicly available LLM apps, such as Copilot, ChatGPT, Google Gemini or Perplexity (you don’t need the sample app I’ve demonstrated) for Text2SQL and Text2DAX and probably you will obtain similar results if you give it the right prompts. I took this approach because I was interested in automating the process for business users.


Teo Lachev
Prologika, LLC | Making Sense of Data
logo

]]>
0
Prologika - Teo Lachev <![CDATA[Generative AI: Excessive Reinforcement Learning]]> https://prologika.com/?p=9290 2024-08-28T20:52:16Z 2024-08-28T20:52:16Z Do you know that without human intervention the Generative AI responses would be too toxic after training the model from the garbage on Internet? Do you know that OpenAI hired Kenyan workers to make ChatGPT less toxic? To throw in some jargon, this is called RLHF (reinforcement learning with human feedback).

Not sure who Microsoft hired to cleanse copilots, but apparently they went overboard. Here is a Windows Copilot jerk-knee response that I got to a prompt asking when to open with 2 No Trump in contract bridge.

I see, Generative AI refuse to generate since “trump” is a taboo topic nowadays. AI dummy!

]]>
0
Prologika - Teo Lachev <![CDATA[Fabric Direct Lake: Memory Utilization with Interactive Operations]]> https://prologika.com/?p=9286 2024-08-15T21:39:32Z 2024-08-15T21:39:32Z As I mentioned in my Power BI and Fabric Capacities: Thinking Outside the Box, memory limits of Fabric capacities could be rather restrictive for large semantic models with imported data. One relatively new option to combat out-of-memory scenarios that deserves to be evaluated and added to the list if Fabric is in your future is semantic models configured for Direct Lake storage. The blog covers results of limited testing that I did comparing side by side the memory utilization of two identical semantic models with the first one configured to import data and the second to use Direct Lake storage. If you need a Direct Lake primer, Chris Webb has done a great job covering its essentials here and here. As a disclaimer, the emphasis is on limited as these results reflect my personal observations based on some isolated tests I’ve done lately. Your results may and probably will vary considerably.

Understanding the Tests

My starting hypothesis was that Direct Lake on-demand loading will utilize memory much more efficiently for interactive operations, such as Power BI report execution. This is a bonus to the fact that data Direct Lake models don’t require refresh. Eliminating refresh could save tremendous amount of memory to start with, even if you apply advanced techniques such as incremental refresh or hybrid tables to models with imported data. Therefore, the tests that follow focus on memory utilization with interactive operations.

To test my hypothesis, I imported the first three months for year 2016 of the NY yellow taxi Azure open dataset to a lakehouse backed up by a Fabric F2 capacity. The resulted in 34.5 million rows distributed across several Delta Parquet files. I limited the data to three months because F2 ran out of memory around the 50 million rows mark with the error “This operation was canceled because there wasn’t enough memory to finish running it. Either reduce the memory footprint of your dataset by doing things such as limiting the amount of imported data, or if using Power BI Premium, increase the memory of the Premium capacity where this dataset is hosted. More details: consumed memory 2851 MB, memory limit 2851 MB, database size before command execution 220 MB”

Descriptive enough and in line with the F2 memory limit of maximum 3 GB per semantic model. I used Power BI desktop to import all that data into a YellowTaxiImported semantic model, which I published to Power BI Service and configured for large storage format. Then, I created online a second YellowTaxiDirectLake semantic model configured for Direct Lake storage mapped directly to the data in the lakehouse. I went back to Power BI desktop to whip up a few analytical (aggregate) queries and a few detail-level queries. Finally, I ran a few tests using DAX Studio.

Analyzing Import Mode

Even after a capacity restart, the YellowTaxiImported model immediately reported 1.4 GB of memory. My conclusion was that that the primary focus of Power BI Premium on-demand loading that was introduced a while back was to speed the first query after the model was evicted from memory. Indeed, I saw that many segments were memory resident and many weren’t, but using queries to touch the non-resident column didn’t increase the memory footprint. The following table lists the query execution times with “Clear On Run” enabled in DAX Studio (to avoid skewing due to cached query data).

Naturally, as the queriers get more detailed, the slower they get because VertiPaq is a columnar database. However, the important observation is that the memory footprint remains constant. Please note that Fabric allocates additional memory to execute the queries, so the memory footprint should grow up as the report load increases.

Query Duration (ms)
//Analytical query 1

EVALUATE
ROW(
“SumtotalAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[totalAmount])),
“SumtollsAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tollsAmount])),
“SumtipAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tipAmount]))
)

124
//Analytical query 2

DEFINE
VAR __DS0Core =
SUMMARIZECOLUMNS(
‘nyc_yellowtaxi'[puMonth],
“SumtotalAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[totalAmount]))
)
VAR __DS0BodyLimited =
SAMPLE(3502, __DS0Core, ‘nyc_yellowtaxi'[puMonth], 1)

EVALUATE
__DS0BodyLimited
ORDER BY
‘nyc_yellowtaxi'[puMonth]

148
//Analytical query 3

DEFINE
VAR __SQDS0Core =
SUMMARIZECOLUMNS(
‘nyc_yellowtaxi'[doLocationId],
“AveragetripDistance”, CALCULATE(AVERAGE(‘nyc_yellowtaxi'[tripDistance]))
)

VAR __SQDS0BodyLimited =
TOPN(50, __SQDS0Core, [AveragetripDistance], 0)

VAR __DS0Core =
SUMMARIZECOLUMNS(
‘nyc_yellowtaxi'[doLocationId],
__SQDS0BodyLimited,
“AveragetripDistance”, CALCULATE(AVERAGE(‘nyc_yellowtaxi'[tripDistance]))
)

VAR __DS0PrimaryWindowed =
TOPN(1001, __DS0Core, [AveragetripDistance], 0, ‘nyc_yellowtaxi'[doLocationId], 1)

EVALUATE
__DS0PrimaryWindowed

ORDER BY
[AveragetripDistance] DESC, ‘nyc_yellowtaxi'[doLocationId]

114
//Analytical query 4

DEFINE
VAR __SQDS0Core =
SUMMARIZECOLUMNS(
‘nyc_yellowtaxi'[doLocationId],
“AveragetripDistance”, CALCULATE(AVERAGE(‘nyc_yellowtaxi'[tripDistance]))
)

VAR __SQDS0BodyLimited =
TOPN(50, __SQDS0Core, [AveragetripDistance], 0)

VAR __DS0Core =
SUMMARIZECOLUMNS(
‘nyc_yellowtaxi'[rateCodeId],
__SQDS0BodyLimited,
“AveragetipAmount”, CALCULATE(AVERAGE(‘nyc_yellowtaxi'[tipAmount]))
)

VAR __DS0BodyLimited =
SAMPLE(3502, __DS0Core, ‘nyc_yellowtaxi'[rateCodeId], 1)
EVALUATE
__DS0BodyLimited

ORDER BY
‘nyc_yellowtaxi'[rateCodeId]

80
//Detail query 1

DEFINE
VAR __DS0FilterTable =
FILTER(KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[startLat])), ‘nyc_yellowtaxi'[startLat] <> 0)

VAR __DS0FilterTable2 =
FILTER(
KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[tpepPickupDateTime])),
‘nyc_yellowtaxi'[tpepPickupDateTime] < DATE(2016, 1, 2)
)

VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(
ROLLUPGROUP(‘nyc_yellowtaxi'[startLat], ‘nyc_yellowtaxi'[startLon]), “IsGrandTotalRowTotal”
),
__DS0FilterTable,
__DS0FilterTable2,
“SumtotalAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[totalAmount])),
“SumtipAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tipAmount])),
“SumtollsAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tollsAmount]))
)

VAR __DS0PrimaryWindowed =
TOPN(
502,
__DS0Core,
[IsGrandTotalRowTotal],
0,
‘nyc_yellowtaxi'[startLat],
1,
‘nyc_yellowtaxi'[startLon],
1
)

EVALUATE
__DS0PrimaryWindowed

ORDER BY
[IsGrandTotalRowTotal] DESC, ‘nyc_yellowtaxi'[startLat], ‘nyc_yellowtaxi'[startLon]

844
//Detail query 2

DEFINE
VAR __DS0FilterTable =
FILTER(KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[startLat])), ‘nyc_yellowtaxi'[startLat] <> 0)

VAR __DS0FilterTable2 =
FILTER(
KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[tpepPickupDateTime])),
‘nyc_yellowtaxi'[tpepPickupDateTime] < DATE(2016, 1, 2)
)

VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(
ROLLUPGROUP(‘nyc_yellowtaxi'[startLat], ‘nyc_yellowtaxi'[startLon]), “IsGrandTotalRowTotal”
),
__DS0FilterTable,
__DS0FilterTable2,
“SumtotalAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[totalAmount])),
“SumtipAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tipAmount])),
“SumtollsAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tollsAmount]))
)

VAR __DS0PrimaryWindowed =
TOPN(
502,
__DS0Core,
[IsGrandTotalRowTotal],
0,
‘nyc_yellowtaxi'[startLat],
1,
‘nyc_yellowtaxi'[startLon],
1
)

EVALUATE
__DS0PrimaryWindowed

ORDER BY
[IsGrandTotalRowTotal] DESC, ‘nyc_yellowtaxi'[startLat], ‘nyc_yellowtaxi'[startLon]

860
//Detail query 3

DEFINE
VAR __DS0FilterTable =
FILTER(KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[startLat])), ‘nyc_yellowtaxi'[startLat] <> 0)

VAR __DS0FilterTable2 =
FILTER(
KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[tpepPickupDateTime])),
‘nyc_yellowtaxi'[tpepPickupDateTime] < DATE(2016, 1, 2)
)

VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(
ROLLUPGROUP(‘nyc_yellowtaxi'[startLat], ‘nyc_yellowtaxi'[startLon]), “IsGrandTotalRowTotal”
),
__DS0FilterTable,
__DS0FilterTable2,
“SumtotalAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[totalAmount])),
“SumtipAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tipAmount])),
“SumtollsAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tollsAmount]))
)

VAR __DS0PrimaryWindowed =
TOPN(
502,
__DS0Core,
[IsGrandTotalRowTotal],
0,
‘nyc_yellowtaxi'[startLat],
1,
‘nyc_yellowtaxi'[startLon],
1
)

EVALUATE
__DS0PrimaryWindowed

ORDER BY
[IsGrandTotalRowTotal] DESC, ‘nyc_yellowtaxi'[startLat], ‘nyc_yellowtaxi'[startLon]

1,213
//Detail query 4 (All Columns)

DEFINE
VAR __DS0FilterTable =
FILTER(KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[startLat])), ‘nyc_yellowtaxi'[startLat] <> 0)

VAR __DS0FilterTable2 =
FILTER(
KEEPFILTERS(VALUES(‘nyc_yellowtaxi'[tpepPickupDateTime])),
‘nyc_yellowtaxi'[tpepPickupDateTime] < DATE(2016, 1, 2)
)

VAR __ValueFilterDM1 =
FILTER(
KEEPFILTERS(
SUMMARIZECOLUMNS(
‘nyc_yellowtaxi'[startLat],
‘nyc_yellowtaxi'[startLon],
‘nyc_yellowtaxi'[paymentType],
‘nyc_yellowtaxi'[vendorID],
‘nyc_yellowtaxi'[improvementSurcharge],
‘nyc_yellowtaxi'[doLocationId],
‘nyc_yellowtaxi'[puLocationId],
‘nyc_yellowtaxi'[storeAndFwdFlag],
‘nyc_yellowtaxi'[tpepDropoffDateTime],
‘nyc_yellowtaxi'[tpepPickupDateTime],
__DS0FilterTable,
__DS0FilterTable2,
“SumtotalAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[totalAmount])),
“SumtipAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tipAmount])),
“SumtollsAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tollsAmount])),
“SumpassengerCount”, CALCULATE(SUM(‘nyc_yellowtaxi'[passengerCount])),
“SumtripDistance”, CALCULATE(SUM(‘nyc_yellowtaxi'[tripDistance])),
“SumendLat”, CALCULATE(SUM(‘nyc_yellowtaxi'[endLat])),
“SumendLon”, CALCULATE(SUM(‘nyc_yellowtaxi'[endLon])),
“Sumextra”, CALCULATE(SUM(‘nyc_yellowtaxi'[extra])),
“SumfareAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[fareAmount])),
“SummtaTax”, CALCULATE(SUM(‘nyc_yellowtaxi'[mtaTax])),
“SumpuMonth”, CALCULATE(SUM(‘nyc_yellowtaxi'[puMonth])),
“SumpuYear”, CALCULATE(SUM(‘nyc_yellowtaxi'[puYear])),
“CountrateCodeId”, CALCULATE(COUNTA(‘nyc_yellowtaxi'[rateCodeId]))
)
),
[SumtollsAmount] > 0
)

VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(
ROLLUPGROUP(
‘nyc_yellowtaxi'[startLat],
‘nyc_yellowtaxi'[startLon],
‘nyc_yellowtaxi'[paymentType],
‘nyc_yellowtaxi'[vendorID],
‘nyc_yellowtaxi'[improvementSurcharge],
‘nyc_yellowtaxi'[doLocationId],
‘nyc_yellowtaxi'[puLocationId],
‘nyc_yellowtaxi'[storeAndFwdFlag],
‘nyc_yellowtaxi'[tpepDropoffDateTime],
‘nyc_yellowtaxi'[tpepPickupDateTime]
), “IsGrandTotalRowTotal”
),
__DS0FilterTable,
__DS0FilterTable2,
__ValueFilterDM1,
“SumtotalAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[totalAmount])),
“SumtipAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tipAmount])),
“SumtollsAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[tollsAmount])),
“SumpassengerCount”, CALCULATE(SUM(‘nyc_yellowtaxi'[passengerCount])),
“SumtripDistance”, CALCULATE(SUM(‘nyc_yellowtaxi'[tripDistance])),
“SumendLat”, CALCULATE(SUM(‘nyc_yellowtaxi'[endLat])),
“SumendLon”, CALCULATE(SUM(‘nyc_yellowtaxi'[endLon])),
“Sumextra”, CALCULATE(SUM(‘nyc_yellowtaxi'[extra])),
“SumfareAmount”, CALCULATE(SUM(‘nyc_yellowtaxi'[fareAmount])),
“SummtaTax”, CALCULATE(SUM(‘nyc_yellowtaxi'[mtaTax])),
“SumpuMonth”, CALCULATE(SUM(‘nyc_yellowtaxi'[puMonth])),
“SumpuYear”, CALCULATE(SUM(‘nyc_yellowtaxi'[puYear])),
“CountrateCodeId”, CALCULATE(COUNTA(‘nyc_yellowtaxi'[rateCodeId]))
)

VAR __DS0PrimaryWindowed =
TOPN(
502,
__DS0Core,
[IsGrandTotalRowTotal],
0,
‘nyc_yellowtaxi'[startLat],
1,
‘nyc_yellowtaxi'[startLon],
1,
‘nyc_yellowtaxi'[paymentType],
1,
‘nyc_yellowtaxi'[vendorID],
1,
‘nyc_yellowtaxi'[improvementSurcharge],
1,
‘nyc_yellowtaxi'[doLocationId],
1,
‘nyc_yellowtaxi'[puLocationId],
1,
‘nyc_yellowtaxi'[storeAndFwdFlag],
1,
‘nyc_yellowtaxi'[tpepDropoffDateTime],
1,
‘nyc_yellowtaxi'[tpepPickupDateTime],
1
)

EVALUATE
__DS0PrimaryWindowed

ORDER BY
[IsGrandTotalRowTotal] DESC,
‘nyc_yellowtaxi'[startLat],
‘nyc_yellowtaxi'[startLon],
‘nyc_yellowtaxi'[paymentType],
‘nyc_yellowtaxi'[vendorID],
‘nyc_yellowtaxi'[improvementSurcharge],
‘nyc_yellowtaxi'[doLocationId],
‘nyc_yellowtaxi'[puLocationId],
‘nyc_yellowtaxi'[storeAndFwdFlag],
‘nyc_yellowtaxi'[tpepDropoffDateTime],
‘nyc_yellowtaxi'[tpepPickupDateTime]

4,240

Analyzing Direct Lake

After another restart, the resident memory footprint of the YellowTaxiDirectLake model was only 22.4 KB! Indeed, the $System.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS DMV showed that only system-generated RowNumber columns were memory resident.

For each query, I recorded two runs to understand how much time is spent in on-demand loading of columns into memory. The Import Mode column was added for convenience to compare the second run duration with the corresponding query duration from the Import Mode tests. Finally, the Model Resident Memory column records the memory footprint of the Direct Lake model.

Query First Run
(ms)
Second Run (ms) Import Mode

(ms)

Model Resident Memory (MB)
//Analytical query 1 79 75 124 14
//Analytical query 2 79 76 148 14.3
//Analytical query 3 382 133 114 68.1
//Analytical query 4 209 130 80 68.13
//Detail query 1 7,763 1,023 844 669.13
//Detail query 2 1,484 1,453 860 670.53
//Detail query 3 1,881 1,463 1,213 670.6
//Detail query 4 9,663 3,668 4,240 1,270

Conclusion

To sum up this long post, the following observations can be made:

  1. As expected, the more columns the query touch, the higher the memory footprint of Direct Lake. For example, the last query requested all the columns, and the resulting memory footprint was at a par with imported mode.
  2. It’s important to note that when Fabric is under memory pressure, such as when the report load increases, Direct Lake will start paging out columns with low temperature. The exact thresholds and rules are not documented but I’d expect the eviction mechanism to be much more granular and intelligent than evicting entire datasets with imported mode.
  3. The reason that I didn’t see Direct Lake paging out memory is because I was still left with plenty (1.27 GB consumed out of 3 GB). It doesn’t make sense evicting data if there is no memory pressure since memory is the fasted storage.
  4. You’ll pay a certain price the first time a column is loaded on demand with Direct Lake. The more columns, the longer the wait. Subsequent runs, however, will be much faster if the column is still mapped in memory.
  5. Some queries will execute faster in import mode and some will execute slower. Overall, queries touching memory-resident columns should be comparable.

Therefore, if Direct Lake is an option for you, it should be at the forefront of your efforts to combat out-of-memory errors with large datasets. On the downside, more than likely you’ll have to implement ETL processes to synchronize your data warehouse to a Fabric lakehouse, unless your data is in Fabric to start with, or you use Fabric database mirroring for the currently supported data sources (Azure SQL DB, Cosmos, and Snowflake). I’m not counting the data synchronization time as a downside because it could supersede the time you currently spend in model refresh.

computer memory with queries executing and Microsoft Fabric logo. Image 4 of 4

]]>
0