LLM Adventures: Text2DAX

July 11, 2024/0 Comments/in Blog/by Prologika - Teo Lachev

In my previous post, I covered how large language models, such as ChatGPT, can be used to convert natural queries to SQL. Let’s now see how Text2DAX fairs. But wait, we have a Microsoft Fabric copilot already for this, right? Yes, but what happens when you click the magic button in PBI Desktop? You are greeted that you need to purchase F64 or larger capacity. It’s a shame that Microsoft has decided that AI should be a super premium feature. Given this horrible predicament, what would a mortal developer strapped for cash do? Create their own copilot of course!

Building upon the previous sample, this is remarkably simple. As I mentioned in the first blog, one great LLM feature is the loose input. To make the schema generation simple, I obtained the table and column names from the Adventure Works semantic model (a *.pbix file opened in PBI Desktop) by running the following query in DAX Studio:

select Table_Name as TableName, Column_Name as ColumnName 
from $SYSTEM.DBSCHEMA_COLUMNS 
where column_olap_type = 'attribute' and table_name <>'$MySelectedMeasure' and table_name <> '$Parameter' and table_name <> '$DynamicMeasure' 
order by table_name

This gave me the model schema in a two-column format (I had to remove the $ in front of the table name which is a relic from OLAP days where dimension names were prefixed with $). Also, notice that I haven’t included model relationships yet.

TableName,ColumnName
"Customer","FirstName"
"Customer","HouseOwnerFlag"
…

Then, I used the following prompts:

As a DAX expert, examine a Power BI data model with the following table definitions where TableName lists the tables and ColumnName lists the columns, and provide DAX query that can answer user questions based on the data model, you will think step by step throughout and return DAX statement directly without any additional explanation.

<Table definition>
TableName,ColumnName
"Customer","FirstName"
"Customer","HouseOwnerFlag"
...

As with Text2SQL, I used a user role prompt to feed a natural question “What are the three best selling cities for the “AWC Logo Cap” product?” followed by an assistant role prompt containing the expected DAX query (noticed that I’ve made a mistake by not using a SUM function to summarize the SalesAmount column):

EVALUATE
TOPN (
3,
SUMMARIZECOLUMNS (
'DimGeography'[City],
"SalesAmount", CALCULATE('FactInternetSales'[SalesAmount], 'DimProduct'[EnglishProductName] = "AWC Logo Cap")
)
)
ORDER BY [SalesAmount] DESC

Then I started asking questions and my amazement grew as the complexity increased. Here are some samples:

What are the three best selling cities for the “AWC Logo Cap” product?

EVALUATE
TOPN (
3,
SUMMARIZECOLUMNS (
'Geography'[City],
"SalesAmount",
CALCULATE (
SUM('InternetSales'[SalesAmount]), -- notice that LLM fixed my syntax error
'Product'[ProductName] = "AWC Logo Cap"
)
),
[SalesAmount], DESC)

Which products sold the most?

EVALUATE
TOPN (
10,
SUMMARIZECOLUMNS (
'Product'[ProductName],
"TotalSalesAmount",
SUM('InternetSales'[SalesAmount])
),
[TotalSalesAmount], DESC)

What was YTD sales variance for 2012 compared to last year?
— Here LLM switch to a measure definition

VAR CurrentYearSales =
CALCULATE (
SUM('InternetSales'[SalesAmount]),
'Date'[CalendarYear] = 2012,
DATESYTD('Date'[Date])
)
VAR PreviousYearSales =
CALCULATE (
SUM('InternetSales'[SalesAmount]),
'Date'[CalendarYear] = 2011,
DATESYTD('Date'[Date])
)
VAR SalesVariance = CurrentYearSales - PreviousYearSales
RETURN SalesVariance

What was the customer churn in 2012?

Now, this is impressive! To start with, LLM had to discover what a “churn” is and to figure out that it needs a date context. At its first attempt, it created an incorrect measure because the Date dimension doesn’t relate to the Customer dimension.

VAR CustomersAtStartOfYear =
CALCULATE (
DISTINCTCOUNT('Customer'[CustomerKey]),
'Date'[CalendarYear] = 2011
)
VAR CustomersAtEndOfYear =
CALCULATE (
DISTINCTCOUNT('Customer'[CustomerKey]),
'Date'[CalendarYear] = 2012
)
VAR NewCustomers =
CALCULATE (
DISTINCTCOUNT('Customer'[CustomerKey]),
'Date'[CalendarYear] = 2012,
NOT EXISTS (
CALCULATETABLE (
'Customer',
'Date'[CalendarYear] = 2011
),
'Customer'[CustomerKey]
)
)
VAR ChurnedCustomers = CustomersAtStartOfYear - CustomersAtEndOfYear + NewCustomers
RETURN ChurnedCustomers

Relationships are important concept in Power BI semantic modeling, and they are both a blessing and a curse. Thanks to these explicit paths, the end user doesn’t need how to join tables. But they also introduce rigidity and are subject of various limitations, such as that Power BI doesn’t support diamond or closed loop relationships (interestingly, at some point Microsoft was planning to implement UI to prompt the user which path to take but this never materialized.

Let’s now tell the model about the relationships. I added the following after the Table Definitions section to enumerate a few of the model relationships.

<Relationships>
InternetSales[OrderDate]='Date'[Date]
InternetSales[ShipDate]='Date'[Date] (inactive)
InternetSales[CustomerKey]=Customer[CustomerKey]
InternetSales[ProductKey]=Product[ProductKey]

Now running the previous query, we get a much better outcome because the measure now evaluates over the InternetSales table. It would have blown my mind if it had used CustomerID (business key) instead of CustomerKey (surrogate key) 😊

VAR CustomersAtStartOfYear =
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2011
)
VAR CustomersAtEndOfYear =
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2012
)
VAR RetainedCustomers =
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2012,
FILTER (
'Customer',
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2011
) > 0
)
)
VAR ChurnedCustomers = CustomersAtStartOfYear - RetainedCustomers
RETURN ChurnedCustomers

Which customers churned in 2012?

–Correctly switching to a DAX query

EVALUATE
EXCEPT (
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2011
),
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2012
)
)

Which customers churned in 2012? Show me the customer name.

EVALUATE
SELECTCOLUMNS (
EXCEPT (
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2011
),
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2012
)
),
"CustomerName", LOOKUPVALUE('Customer'[FullName], 'Customer'[CustomerKey], [CustomerKey])
)

Which customers churned in 2012? Use ship date instead of order date
–Notice how LLM adds USERELATIONSHIP.

EVALUATE
SELECTCOLUMNS (
EXCEPT (
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date',
'Date'[CalendarYear] = 2011,
USERELATIONSHIP('InternetSales'[ShipDate], 'Date'[Date])
),
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date',
'Date'[CalendarYear] = 2012,
USERELATIONSHIP('InternetSales'[ShipDate], 'Date'[Date])
)
),
"CustomerName", LOOKUPVALUE('Customer'[FullName], 'Customer'[CustomerKey], [CustomerKey])
)

In summary, it appears that LLM can effectively assist us in writing code. The emphasis is on assist because I view the LLM role as a second set of eyes. Hey, what do you think about this problem I’m trying to solve here? LLM doesn’t absolve us from doing our homework and learning the fundamentals, nor it can compensate for improper design. While LLM might not always generate the optimum code and might sometimes fabricate, it can definitely assist you in creating business calculations, generating test queries, and learning along the way.

BTW, you can use any of the publicly available LLM apps, such as Copilot, ChatGPT, Google Gemini or Perplexity (you don’t need the sample app I’ve demonstrated) for Text2SQL and Text2DAX and probably you would obtain similar results if you give it the right prompts. I used this app because I was interested in automating the process.

LLM Adventures: Text2SQL

July 7, 2024/0 Comments/in Blog/by Prologika - Teo Lachev

As inspired by Kevin Jordain, who clued me about the increased demand for using natural questions to analyze client’s data, John Kerski, who did a great presentation to our Atlanta BI Group on integrating Power Query with Azure OpenID, and Kyle Hale, who believes (wrongly 😊) that the Databricks Genie will make Power BI and semantic modeling obsolete, I set on a quest to find how effective Large Language Models (LLM) are in generating SQL from natural questions, also known as Text2SQL. What a better way to spend some free time around the 4^th of July holiday, right?

I have to say that I was impressed with LLM. I used the excellent Ric Zhou’s Text2SQL sample as a starting point inside Visual Studio Code. The sample uses the Python streamlit framework to create a web app that submits natural questions to Azure OpenAI.

My humble contributions were:

I switched to the Azure OpenAI chatgpt-4o model.
Because the Python openai module has deprecated ChatCompletion, I transitioned the code to use the new way of interacting, as I explained in the related GitHub issue.
I used the AdventureWorksDW2019 database which I deployed to Azure SQL Database.

I was amazed how simple the LLM input was. Given that it’s trained with many popular languages, including SQL, all you have to do is provide some context, database schema (generated in a simple format by a provided tool), and a few prompts:

[{'role': 'system', 'content': 'You are smart SQL expert who can do text to SQL, following is the Azure SQL database data model <database schema>},
{'role': 'user', 'content': 'What are the three best selling cities for the "AWC Logo Cap" product?'},
{'role': 'assistant', 'content': 'SELECT TOP 3 A.City, sum(SOD.LineTotal) AS TotalSales \\nFROM [SalesLT].[SalesOrderDe...BY A.City \\nORDER BY TotalSales DESC; \\n'},
{'role': 'user', 'content': natural question here'}]

Let’s decipher this:

The first prompt provides context to the model for our intention to act as a SQL expert.
The sample includes a tool that generates the database schema consisting of table and columns in the following format below. Noticed that referential integrity constraints are not included (the model doesn’t need to know how the tables are related!) There is nothing special in the syntax. It is fluid (impressive!), which is a paradigm shift from structured XML or JSON where missing a bracket crashes the system. So, I could have expressed the schema using the following notation and it would still work (ideally, we should also provide the column data types):
Table: FactSalesQuota
– SalesQuotaKey
– other columns
or
Table: FactSalesQuota
Columns: column1, column2
or
CREATE TABLE (DML here)

<Table definition>
[dbo].[FactSalesQuota]
-SalesQuotaKey
-EmployeeKey
-DateKey
-CalendarYear
-CalendarQuarter
-SalesAmountQuota
-Date

[dbo].[DimAccount]
-AccountKey 
-…

The next prompt assumes the role of an end user who will ask a natural question, such as ‘What are the three best selling cities for the “AWC Logo Cap” product?’
The next prompt assumes the role of an assistant who hints the model what the correct query should be.
The final prompt acts as a user who now provides what natural question you enter.

The sample also supports running the query against the database and visualizing the results as a table, bar chart, area chart, or line chart. Here are some results

What are the three best selling cities for the “AWC Logo Cap” product?

SELECT TOP 3 G.City, SUM(FIS.SalesAmount) AS TotalSales
FROM [dbo].[FactInternetSales] FIS
JOIN [dbo].[DimProduct] P ON FIS.ProductKey = P.ProductKey
JOIN [dbo].[DimCustomer] C ON FIS.CustomerKey = C.CustomerKey
JOIN [dbo].[DimGeography] G ON C.GeographyKey = G.GeographyKey
WHERE P.EnglishProductName = 'AWC Logo Cap'
GROUP BY G.City
ORDER BY TotalSales DESC;

Which products sold the most?

SELECT P.EnglishProductName, SUM(FIS.SalesAmount) AS TotalSales
FROM [dbo].[FactInternetSales] FIS
JOIN [dbo].[DimProduct] P ON FIS.ProductKey = P.ProductKey
GROUP BY P.EnglishProductName
ORDER BY TotalSales DESC;

Which customers bought something in 2012 but didn’t buy anything in 2011?

SELECT DISTINCT C.CustomerKey, C.FirstName, C.LastName
FROM [dbo].[FactInternetSales] FIS2012
JOIN [dbo].[DimCustomer] C ON FIS2012.CustomerKey = C.CustomerKey
WHERE FIS2012.OrderDateKey BETWEEN 20120101 AND 20121231
AND C.CustomerKey NOT IN (
SELECT DISTINCT FIS2011.CustomerKey
FROM [dbo].[FactInternetSales] FIS2011
WHERE FIS2011.OrderDateKey BETWEEN 20110101 AND 20111231);

All SQL statements were syntactically and logically correct. Pretty amazing, right? Especially considering that the model had to figure out how the tables related, such as that DimCustomer is related to FactInternetSales but not to FactResellerSales.

In summary, it appears that LLM can be used effectively for generating SQL from natural text. This is opening new possibilities for ad hoc data analytics and for assisting users with learning or generating code. I believe that simplified schema is very important, such as star schema. This underscores the importance of schema design for data analytics, something that AI can’t do (yet).

What about Text2DAX? Can LLM do it? You bet! DAX happens to be one of the programming languages used for training, so stay tuned for the next blog post where I’ll present you with a poor man’s DAX copilot.

Atlanta Microsoft BI Group Meeting on July 1st (Commenting Power Query with Azure OpenAI)

June 25, 2024/0 Comments/in Blog, Events/by Prologika - Teo Lachev

Atlanta BI fans, please join us in person for the next meeting on Monday, July 1st at 6:30 PM ET. John Kerski (Microsoft MVP) will shows us how to integrate ChatGPT with Power BI. Your humble correspondent will help you catch up on Microsoft BI latest. CloudStaff.ai will sponsor the event. For more details and sign up, visit our group page.

Details

Presentation: Commenting Power Query with Azure OpenAI
Delivery: In-person
Date: July 1, 2024
Time: 18:30 – 20:30 ET
Level: Intermediate
Food: Pizza and drinks will be provided

Agenda:
18:15-18:30 Registration and networking
18:30-19:00 Organizer and sponsor time (events, Power BI latest, sponsor marketing)
19:00-20:15 Main presentation
20:15-20:30 Q&A

Venue
Improving Office
11675 Rainwater Dr
Suite #100
Alpharetta, GA 30009

Overview: Large Language Models (such as ChatGPT) can greatly enhance the way you develop and deliver Power BI solutions. In this session I will show you how to integrate Azure Open AI into Power BI using prompt engineering techniques.

Speaker: John Kerski has over a decade of experience in technical and government leadership. He specializes in managing Data Analytics projects and implementing DataOps principles to enhance solution delivery and minimize errors. John’s expertise is showcased through his ability to offer patterns and templates that streamline the adoption of DataOps with Microsoft Fabric and Power BI. His in-depth knowledge and hands-on approach provide clients with practical tools to achieve efficient and effective data operations.

Sponsor: CloudStaff.ai

Fabric Analytics Engineer Certification Snagged

June 7, 2024/0 Comments/in Blog, News/by Prologika - Teo Lachev

Just passed DP-600 and therefore I’m officially a Fabric Analytics Engineer. Besides renewals, I intend this to be my last Microsoft certification. After more than 30 years of certifying (12 certifications and 30 exams) and helping other certify with training and books, I think it’s time to close this chapter. Call me certifiable…

Modern Data Warehouse (MDM) Reloaded

June 1, 2024/0 Comments/in Blog/by Prologika - Teo Lachev

“Where are the prophets, where are the visionaries, where are the poets
To breach the dawn of the sentimental mercenary”
“Fugazi”, Marillion

I’ve written in the past about the dangers of blindly following “modern” data architectures (see my posts “Are you modern yet?” and “Data Lakehouse: The Good, the Bad, and the Ugly”) but here we go again. Once upon a time, a large company hired a large Microsoft partner to solve a common and pervasive challenge. Left on their own, departments have set up their own data servers, thus creating data silos leading to data duplication and inconsistent results.

How did the large vendor attempt to solve these horrible issues? Modern Data Warehouse (MDM) architecture of course. Nothing wrong with it except that EDW and organizational semantic model(s) are missing and that most of the effort went into implementing the data lake medallion architecture where all the incoming data ended up staged as Parquet files. It didn’t matter that 99% of the data came from relational databases. Further, to solve a data change tracking requirement, the vendor decided to create a new file each time ETL runs. So even if nothing has changed in the source feed, the data is duplicated should one day the user wants to go back in time and see what the data looked like then. There are of course better ways to handle this that doesn’t even require ETL, such as SQL Server temporal tables, but I digress.

At least some cool heads prevailed and the Silver layer got implemented as a relational ODS to serve the needs of home-grown applications, so the apps didn’t have to deal with files. What about EDW and organizational semantic models? Not there because the project ran out of budget and time. I bet if that vendor got hired today, they would have gone straight for Fabric Lakehouse and Fabric premium pricing (nowadays Microsoft treats partners as an extension to its salesforce and requires them to meet certain revenue targets as I explain in “Dissolving Partnerships“), which alone would have produced the same outcome.

What did the vendor accomplish? Not much. Nor only didn’t the implementation address the main challenges, but it introduced new, such as overcomplicated ETL and redundant data staging. Although there might be good reasons for file staging (see the second blog above), in most cases I consider it a lunacy to stage perfect relational data to files, along the way losing metadata, complicating ETL, ending up serverless, and then reloading the same data into a relational database (ODS in this case).

I’ve heard that the vendor justified the lake effort by empowering data scientists to do ML one day. I’d argue that if that day ever comes, the likelihood (pun not intended) of data scientists working directly on the source schema would be infinitely small since more than likely they would require the input datasets to be shaped in a different way which would probably require another ETL pipeline altogether.

I don’t subject my clients to excessive file staging. My file staging litmus test is what’s the source data format. If I can connect to a server and get in a tabular (relational) format, I stage it directly to a relational database (ODS or DW). However, if it’s provided as files (downloaded or pushed, reference data, or unstructured data), then obviously there is no other way. That’s why we have lakes.

Fast forward a few years, and your humble correspondent got hired to assess the damage and come up with remediation. Data lakes won’t do it. Lakehouses and Delta Parquet (a poor attempt to recreate and replace relational databases) won’t do it. Fabric won’t do it and it’s too bad that Microsoft pushes Lakehouse while the main focus should be Fabric Data Warehouse, which unfortunately is not ready for prime time (but we have plenty of other options).

What will do it? Going back to the basics and embracing the “Discipline at the core, flexibility at edge” ideology (kudos to Microsoft for publishing their lessons learned). From a technology standpoint, the critical pieces are EDW and organizational semantic models. If you don’t have these, I’m sorry but you are not modern yet. In fact, you aren’t even classic, considering that they have been around for long, long time.

So, keep on implementing these medallion lakes to keep my consulting pipeline full. Drop a comment how they work for you.

Atlanta Microsoft BI Group Meeting on June 3rd (Power BI Direct Lake storage mode)

May 28, 2024/0 Comments/in Blog, Events/by Prologika - Teo Lachev

Atlanta BI fans, please join us in person for the next meeting on Monday, June 3rd at 6:30 PM ET. Shabnam Watson (Consultant and Owner of ABI Cube) will discuss the benefits of using the Direct Lake storage mode in Microsoft Fabric. Your humble correspondent will help you catch up on Microsoft BI latest. CloudStaff.ai will sponsor the event. For more details and sign up, visit our group page.

Presentation: Power BI Direct Lake storage mode: How to achieve blazing fast performance without importing data
Delivery: In-person
Time: 18:30 – 20:30 ET
Level: Beginner/Intermediate
Food: Pizza and drinks will be provided

Agenda:
18:15-18:30 Registration and networking
18:30-19:00 Organizer and sponsor time (events, Power BI latest, sponsor marketing)
19:00-20:15 Main presentation
20:15-20:30 Q&A

Venue
Improving Office
11675 Rainwater Dr
Suite #100
Alpharetta, GA 30009

Overview: Power BI engine in Microsoft Fabric has been significantly revamped to work directly with Delta files in OneLake. This brand-new storage mode is called Direct Lake which allows Power BI to achieve super-fast query performance on billion row datasets without having to import the data into Power BI. Join this session to learn how you can work with Direct Lake with just a few clicks.

Speaker: Shabnam is a business intelligence consultant and owner of ABI Cube, a company that specializes in delivering data solutions using the Microsoft Data Platform. She has over 20 years of experience and is recognized as a Microsoft Data Platform MVP for her technical excellence and community involvement. She is passionate about helping organizations harness the power of data to drive insights and innovation. She has a deep expertise in Microsoft Analysis Services, Power BI, Azure Synapse Analytics, and Microsoft Fabric. She is also a speaker, blogger, and organizer for SQL Saturday Atlanta – BI version, where she shares her knowledge and best practices with the data community.

Sponsor: CloudStaff.ai

Tips for Configuring Azure Data Factory Alerts

April 15, 2024/0 Comments/in Blog/by Prologika - Teo Lachev

Alerting is an important monitoring task for any ETL process. Azure Data Factory can integrate with a generic Azure event framework (Azure Monitor) which makes it somewhat unintuitive for ETL monitoring. You can set up and change the alerts using the ADF Monitoring hub.

You might find the following tips useful for setting up an ADF alert:

Dimension – I like to nest pipelines where a master pipeline starts dimension and fact master pipelines, which in turn execute unit pipelines, such as a loading a dimension table. This allows me to restart ETL at any level. In the Dimension dropdown (nothing to do with DW dimensions), I select all such master pipelines because if an error is triggered downstream, it will bubble up to the master.
Failure Type – I select all failure types.
Condition – I configure the alert to be generated as soon as one failure occurs.
Period – This is a tricky one. It specifies over what period the threshold failure count will be evaluated. Let’s say you schedule the ETL process nightly at 12 AM. If you specify a value lower than 24 hours, let’s say 6 hours, Azure Monitor will check for failures every 6 hours. If ETL fails at say 12:15 AM, you will get the alert the next time Azure Monitor runs as you should. But then, after it checks in another 6 hours and finds no failure, it will send a notification that the failure is resolved, while probably it’s not. Therefore, I set the period to 24 hours. Obviously, this will be an issue if ETL runs weekly but 24 hours is the longest period supported.
Frequency – I set this to the longest frequency of 1 hour as it doesn’t make sense to check frequently for failures if the process runs once in a day. With this setup, the threshold count will be evaluated every hour for each 24-hour cycle.

Atlanta Microsoft BI Group Meeting on April 1st (Real-Time Analytics with Microsoft Fabric: Unlocking the Power of Streaming Data)

March 26, 2024/0 Comments/in Blog, Events/by Prologika - Teo Lachev

Atlanta BI fans, please join us in person for the next meeting on Monday, April 1st at 6:30 PM ET. Aravinth Krishnasamy (Principal Architect at Ecolab) will provide an end-to-end overview of Microsoft Fabric real-time analytics capabilities. Your humble correspondent will help you catch up on Microsoft BI latest. CloudStaff.ai will sponsor the event. For more details and sign up, visit our group page.

Presentation: Real-Time Analytics with Microsoft Fabric: Unlocking the Power of Streaming Data

Delivery: In-person

Date: April 1, 2024

Time: 18:30 – 20:30 ET

Level: Beginner/Intermediate

Food: Pizza and drinks

Agenda:

18:15-18:30 Registration and networking

18:30-19:00 Organizer and sponsor time (events, Microsoft BI latest, sponsor marketing)

19:00-20:15 Main presentation

20:15-20:30 Q&A

Overview: This session will provide an end-to-end overview of Microsoft Fabric Real-Time Analytics capabilities. We will go over the following topics:
1. Introduction to Real-Time Analytics: Overview of the platform and its capabilities
2. Data Ingestion: How to ingest data from various streaming sources into Fabric
3. Data Analysis & Visualization: How to analyze and visualize data using Real-Time Analytics and Power BI
4. Use Cases: Real-world use cases for Real-Time Analytics.

Speaker: Aravinth Krishnasamy is a Principal Architect at Ecolab, where he focuses on business intelligence, data warehousing and advanced analytics applications. Aravinth holds numerous technical certifications and has over 18 years of IT experience.

Sponsor: CloudStaff.ai

What Can Fabric Do For My Lake?

March 13, 2024/0 Comments/in Blog/by Prologika - Teo Lachev

Previously, I discussed the pros and cons of Microsoft Fabric OneLake and Lakehouse. But what if you have a data lake already? Will Fabric add any value, especially if your organization is on Power BI Premium and you get Fabric features for free (that is, assuming you are not overloading your capacity resources)? Well, it depends.

Managed Area

A Fabric lakehouse defines two areas: managed and unmanaged. The managed area (Tables folder) is exclusively for Delta/Parquet tables. If you have your own data lake with Delta/Parquet files, such as Databricks delta lake, you can create shortcuts to these files or folders located in ADLS Gen 2 or Amazon S3. Consequently, the Fabric lakehouse would automatically register these shortcuts as tables.

Life is good in the managed area. Shortcuts to Delta/Parquet tables open interesting possibilities for data virtualization, such as:

Your users can use the Lakehouse SQL Analytics endpoint to join tables using SQL. This is useful for ad-hoc analysis. Joins could also be useful so users can shape the data they need before importing it in Power BI Desktop as opposed to connecting to individual files and using Power Query to join the tables. Not only could this reduce the size of the ingested data, but it could also improve refresh performance.
Users can decide not to import the data at all but build semantic models in Direct Lake mode. This could be very useful to reduce latency or avoid caching large volumes of data.

Unmanaged Area

Very few organizations would have lakes with Delta Parquet files. Most data lakes contain heterogeneous files, such as text, Excel, or regular Parquet files. While a Fabric lakehouse can create shortcuts to any file, non Delta/Parquet shortcuts will go to the unmanaged area (Files folder).

Life is miserable in the unmanaged area. None of the cool stuff you see in demos happens here because the analytical endpoint and direct lake modes are not available. A weak case can still be made for data virtualization that shortcuts bring data readily available to where business users collaborate in Power BI: the Power BI workspace.

But what can the user do with these unmanaged shortcuts? Not much really. Power BI Desktop doesn’t even expose them when you connect to the lakehouse. Power BI dataflows Gen2 do give the user access to the Files folder so potentially users can create dataflows and transform data from these files.

Of course, the tradeoff here is that you are adding dependencies to OneLake which could be a problem should one day you decide to part ways. Another issue could be that you are layering Power BI security on top of your data lake security.

Oh yes, users can also load Parquet and CSV files to Delta tables by right-clicking a folder or a file in the Unmanaged area, and then selecting Load to Tables (New or Existing). Unfortunately, as it stands, this is a manual process that must be repeated when the source data changes.

Imagined Unmanaged Data Virtualization

This brings me to the two things that I believe Microsoft can do to greatly increase the value proposition of “unmanaged” data virtualization:

Extend load to table to the most popular file formats, such as JSON, XML, and Excel. Or, at least the ones that Polybase has been supporting for years. Not sure why we have to obsess with Delta Parquet and nothing else if Microsoft is serious about data virtualization.
Implement automatic synchronization to update the corresponding Delta table when the source file changes.

If these features are added, throwing Fabric to the mix could become more appealing.

In summary, Microsoft Fabric has embraced Delta Parquet as its native storage file format and has added various features that targets it. Unfortunately none of these features extend to other file formats. You must evaluate pros and cons when adopting Fabric with existing data lakes. As it stands, Fabric probably wouldn’t add much business value for data virtualization over file formats other than Delta Paquet files. As Fabric matures, new scenarios might be feasible to justify Fabric integration and dependency.

Make a Pledge and Trouble is at Hand: Microsoft AI and Fabric Copilots

March 11, 2024/0 Comments/in Blog/by Prologika - Teo Lachev

Rounding up the wisdom from the inscriptions of the Apollo Temple (see the Nothing in Excess and Know Thyself posts), the last known famous inscription was “Make a pledge and trouble is at hand.” Although the original intention was lost in time, it’s believed that this inscription instructs us to understand our limits so we don’t overpromise.

Fast forward to modern data analytics, I remember how Microsoft Power BI started less than 10 years ago with the grand promise to democratize BI to the masses. One would believe that AI would be an essential ingredient to fulfill that promise. To its credit, Power BI includes various built-in and useful ML features, such as Analyze Increase/Decrease (my favorite), Decomposition Tree, Key Influencers, Get Insights, and Q&A.

It’s surprising to me though that Fabric copilots are an ultra premium feature as they require at least P1 or its Fabric equivalent of F64 capacity with a price tag of $8,600 per month. As such, they will be an option only for large organizations that can afford this budget. Apparently, Microsoft sees copilots as the cherry on top of the Fabric pie that will commit undecided customers to premium spending. What happened to democratization and increased productivity? Domains of the rich?

I can’t try the Fabric copilots because they aren’t available even with Fabric Trial. But judging from the demos Microsoft provided, I’m not that impressed with them anyway. They all seem poor code generators on top of the Microsoft OpenAI service.

I remember how Q&A and natural interfaces were supposed to revolutionize data analytics. Now I don’t know of anyone using Q&A, but it can certainly produce exciting demos. As another footnote to Gartner-inspired grand visions, Fabric copilots will probably follow the same fate with a faster acceleration because of their hefty price tag.