Posts

LLM Adventures: Text2DAX

In my previous post, I covered how large language models, such as ChatGPT, can be used to convert natural queries to SQL. Let’s now see how Text2DAX fairs. But wait, we have a Microsoft Fabric copilot already for this, right? Yes, but what happens when you click the magic button in PBI Desktop? You are greeted that you need to purchase F64 or larger capacity. It’s a shame that Microsoft has decided that AI should be a super premium feature. Given this horrible predicament, what would a mortal developer strapped for cash do? Create their own copilot of course!

A screenshot of a chat Description automatically generated

Building upon the previous sample, this is remarkably simple. As I mentioned in the first blog, one great LLM feature is the loose input. To make the schema generation simple, I obtained the table and column names from the Adventure Works semantic model (a *.pbix file opened in PBI Desktop) by running the following query in DAX Studio:

select Table_Name as TableName, Column_Name as ColumnName 
from $SYSTEM.DBSCHEMA_COLUMNS 
where column_olap_type = 'attribute' and table_name <>'$MySelectedMeasure' and table_name <> '$Parameter' and table_name <> '$DynamicMeasure' 
order by table_name

This gave me the model schema in a two-column format (I had to remove the $ in front of the table name which is a relic from OLAP days where dimension names were prefixed with $). Also, notice that I haven’t included model relationships yet.

TableName,ColumnName
"Customer","FirstName"
"Customer","HouseOwnerFlag"
…

Then, I used the following prompts:

As a DAX expert, examine a Power BI data model with the following table definitions where TableName lists the tables and ColumnName lists the columns, and provide DAX query that can answer user questions based on the data model, you will think step by step throughout and return DAX statement directly without any additional explanation.

<Table definition>
TableName,ColumnName
"Customer","FirstName"
"Customer","HouseOwnerFlag"
...

As with Text2SQL, I used a user role prompt to feed a natural question “What are the three best selling cities for the “AWC Logo Cap” product?” followed by an assistant role prompt containing the expected DAX query (noticed that I’ve made a mistake by not using a SUM function to summarize the SalesAmount column):

EVALUATE
TOPN (
3,
SUMMARIZECOLUMNS (
'DimGeography'[City],
"SalesAmount", CALCULATE('FactInternetSales'[SalesAmount], 'DimProduct'[EnglishProductName] = "AWC Logo Cap")
)
)
ORDER BY [SalesAmount] DESC

Then I started asking questions and my amazement grew as the complexity increased. Here are some samples:

What are the three best selling cities for the “AWC Logo Cap” product?

EVALUATE
TOPN (
3,
SUMMARIZECOLUMNS (
'Geography'[City],
"SalesAmount",
CALCULATE (
SUM('InternetSales'[SalesAmount]), -- notice that LLM fixed my syntax error
'Product'[ProductName] = "AWC Logo Cap"
)
),
[SalesAmount], DESC)

Which products sold the most?

EVALUATE
TOPN (
10,
SUMMARIZECOLUMNS (
'Product'[ProductName],
"TotalSalesAmount",
SUM('InternetSales'[SalesAmount])
),
[TotalSalesAmount], DESC)

What was YTD sales variance for 2012 compared to last year?
— Here LLM switch to a measure definition

VAR CurrentYearSales =
CALCULATE (
SUM('InternetSales'[SalesAmount]),
'Date'[CalendarYear] = 2012,
DATESYTD('Date'[Date])
)
VAR PreviousYearSales =
CALCULATE (
SUM('InternetSales'[SalesAmount]),
'Date'[CalendarYear] = 2011,
DATESYTD('Date'[Date])
)
VAR SalesVariance = CurrentYearSales - PreviousYearSales
RETURN SalesVariance

What was the customer churn in 2012?

Now, this is impressive! To start with, LLM had to discover what a “churn” is and to figure out that it needs a date context. At its first attempt, it created an incorrect measure because the Date dimension doesn’t relate to the Customer dimension.

VAR CustomersAtStartOfYear =
CALCULATE (
DISTINCTCOUNT('Customer'[CustomerKey]),
'Date'[CalendarYear] = 2011
)
VAR CustomersAtEndOfYear =
CALCULATE (
DISTINCTCOUNT('Customer'[CustomerKey]),
'Date'[CalendarYear] = 2012
)
VAR NewCustomers =
CALCULATE (
DISTINCTCOUNT('Customer'[CustomerKey]),
'Date'[CalendarYear] = 2012,
NOT EXISTS (
CALCULATETABLE (
'Customer',
'Date'[CalendarYear] = 2011
),
'Customer'[CustomerKey]
)
)
VAR ChurnedCustomers = CustomersAtStartOfYear - CustomersAtEndOfYear + NewCustomers
RETURN ChurnedCustomers

Relationships are important concept in Power BI semantic modeling, and they are both a blessing and a curse. Thanks to these explicit paths, the end user doesn’t need how to join tables. But they also introduce rigidity and are subject of various limitations, such as that Power BI doesn’t support diamond or closed loop relationships (interestingly, at some point Microsoft was planning to implement UI to prompt the user which path to take but this never materialized.

Let’s now tell the model about the relationships. I added the following after the Table Definitions section to enumerate a few of the model relationships.

<Relationships>
InternetSales[OrderDate]='Date'[Date]
InternetSales[ShipDate]='Date'[Date] (inactive)
InternetSales[CustomerKey]=Customer[CustomerKey]
InternetSales[ProductKey]=Product[ProductKey]

Now running the previous query, we get a much better outcome because the measure now evaluates over the InternetSales table. It would have blown my mind if it had used CustomerID (business key) instead of CustomerKey (surrogate key) 😊

VAR CustomersAtStartOfYear =
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2011
)
VAR CustomersAtEndOfYear =
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2012
)
VAR RetainedCustomers =
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2012,
FILTER (
'Customer',
CALCULATE (
DISTINCTCOUNT('InternetSales'[CustomerKey]),
'Date'[CalendarYear] = 2011
) > 0
)
)
VAR ChurnedCustomers = CustomersAtStartOfYear - RetainedCustomers
RETURN ChurnedCustomers

Which customers churned in 2012?

–Correctly switching to a DAX query

EVALUATE
EXCEPT (
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2011
),
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2012
)
)

Which customers churned in 2012? Show me the customer name.

EVALUATE
SELECTCOLUMNS (
EXCEPT (
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2011
),
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date'[CalendarYear] = 2012
)
),
"CustomerName", LOOKUPVALUE('Customer'[FullName], 'Customer'[CustomerKey], [CustomerKey])
)

Which customers churned in 2012? Use ship date instead of order date
–Notice how LLM adds USERELATIONSHIP.

EVALUATE
SELECTCOLUMNS (
EXCEPT (
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date',
'Date'[CalendarYear] = 2011,
USERELATIONSHIP('InternetSales'[ShipDate], 'Date'[Date])
),
CALCULATETABLE (
VALUES('Customer'[CustomerKey]),
'InternetSales',
'Date',
'Date'[CalendarYear] = 2012,
USERELATIONSHIP('InternetSales'[ShipDate], 'Date'[Date])
)
),
"CustomerName", LOOKUPVALUE('Customer'[FullName], 'Customer'[CustomerKey], [CustomerKey])
)

In summary, it appears that LLM can effectively assist us in writing code. The emphasis is on assist because I view the LLM role as a second set of eyes. Hey, what do you think about this problem I’m trying to solve here? LLM doesn’t absolve us from doing our homework and learning the fundamentals, nor it can compensate for improper design. While LLM might not always generate the optimum code and might sometimes fabricate, it can definitely assist you in creating business calculations, generating test queries, and learning along the way.

BTW, you can use any of the publicly available LLM apps, such as Copilot, ChatGPT, Google Gemini or Perplexity (you don’t need the sample app I’ve demonstrated) for Text2SQL and Text2DAX and probably you would obtain similar results if you give it the right prompts. I used this app because I was interested in automating the process.

LLM Adventures: Text2SQL

As inspired by Kevin Jordain, who clued me about the increased demand for using natural questions to analyze client’s data, John Kerski, who did a great presentation to our Atlanta BI Group on integrating Power Query with Azure OpenID, and Kyle Hale, who believes (wrongly 😊) that the Databricks Genie will make Power BI and semantic modeling obsolete, I set on a quest to find how effective Large Language Models (LLM) are in generating SQL from natural questions, also known as Text2SQL. What a better way to spend some free time around the 4th of July holiday, right?

I have to say that I was impressed with LLM. I used the excellent Ric Zhou’s Text2SQL sample as a starting point inside Visual Studio Code. The sample uses the Python streamlit framework to create a web app that submits natural questions to Azure OpenAI.

A screenshot of a chat Description automatically generated

My humble contributions were:

  1. I switched to the Azure OpenAI chatgpt-4o model.
  2. Because the Python openai module has deprecated ChatCompletion, I transitioned the code to use the new way of interacting, as I explained in the related GitHub issue.
  3. I used the AdventureWorksDW2019 database which I deployed to Azure SQL Database.

I was amazed how simple the LLM input was. Given that it’s trained with many popular languages, including SQL, all you have to do is provide some context, database schema (generated in a simple format by a provided tool), and a few prompts:

[{'role': 'system', 'content': 'You are smart SQL expert who can do text to SQL, following is the Azure SQL database data model <database schema>},
{'role': 'user', 'content': 'What are the three best selling cities for the "AWC Logo Cap" product?'},
{'role': 'assistant', 'content': 'SELECT TOP 3 A.City, sum(SOD.LineTotal) AS TotalSales \\nFROM [SalesLT].[SalesOrderDe...BY A.City \\nORDER BY TotalSales DESC; \\n'},
{'role': 'user', 'content': natural question here'}]

Let’s decipher this:

  1. The first prompt provides context to the model for our intention to act as a SQL expert.
  2. The sample includes a tool that generates the database schema consisting of table and columns in the following format below. Noticed that referential integrity constraints are not included (the model doesn’t need to know how the tables are related!) There is nothing special in the syntax. It is fluid (impressive!), which is a paradigm shift from structured XML or JSON where missing a bracket crashes the system. So, I could have expressed the schema using the following notation and it would still work (ideally, we should also provide the column data types):
    Table: FactSalesQuota
    – SalesQuotaKey
    – other columns

    or
    Table: FactSalesQuota
    Columns: column1, column2

    or
    CREATE TABLE (DML here)

<Table definition>
[dbo].[FactSalesQuota]
-SalesQuotaKey
-EmployeeKey
-DateKey
-CalendarYear
-CalendarQuarter
-SalesAmountQuota
-Date

[dbo].[DimAccount]
-AccountKey 
-…
  1. The next prompt assumes the role of an end user who will ask a natural question, such as ‘What are the three best selling cities for the “AWC Logo Cap” product?’
  2. The next prompt assumes the role of an assistant who hints the model what the correct query should be.
  3. The final prompt acts as a user who now provides what natural question you enter.

The sample also supports running the query against the database and visualizing the results as a table, bar chart, area chart, or line chart. Here are some results

  1. What are the three best selling cities for the “AWC Logo Cap” product?
    SELECT TOP 3 G.City, SUM(FIS.SalesAmount) AS TotalSales
    FROM [dbo].[FactInternetSales] FIS
    JOIN [dbo].[DimProduct] P ON FIS.ProductKey = P.ProductKey
    JOIN [dbo].[DimCustomer] C ON FIS.CustomerKey = C.CustomerKey
    JOIN [dbo].[DimGeography] G ON C.GeographyKey = G.GeographyKey
    WHERE P.EnglishProductName = 'AWC Logo Cap'
    GROUP BY G.City
    ORDER BY TotalSales DESC;
  1. Which products sold the most?
    SELECT P.EnglishProductName, SUM(FIS.SalesAmount) AS TotalSales
    FROM [dbo].[FactInternetSales] FIS
    JOIN [dbo].[DimProduct] P ON FIS.ProductKey = P.ProductKey
    GROUP BY P.EnglishProductName
    ORDER BY TotalSales DESC;
  1. Which customers bought something in 2012 but didn’t buy anything in 2011?
    SELECT DISTINCT C.CustomerKey, C.FirstName, C.LastName
    FROM [dbo].[FactInternetSales] FIS2012
    JOIN [dbo].[DimCustomer] C ON FIS2012.CustomerKey = C.CustomerKey
    WHERE FIS2012.OrderDateKey BETWEEN 20120101 AND 20121231
    AND C.CustomerKey NOT IN (
    SELECT DISTINCT FIS2011.CustomerKey
    FROM [dbo].[FactInternetSales] FIS2011
    WHERE FIS2011.OrderDateKey BETWEEN 20110101 AND 20111231);

All SQL statements were syntactically and logically correct. Pretty amazing, right? Especially considering that the model had to figure out how the tables related, such as that DimCustomer is related to FactInternetSales but not to FactResellerSales.

In summary, it appears that LLM can be used effectively for generating SQL from natural text. This is opening new possibilities for ad hoc data analytics and for assisting users with learning or generating code. I believe that simplified schema is very important, such as star schema. This underscores the importance of schema design for data analytics, something that AI can’t do (yet).

What about Text2DAX? Can LLM do it? You bet! DAX happens to be one of the programming languages used for training, so stay tuned for the next blog post where I’ll present you with a poor man’s DAX copilot.