A First Look at Microsoft Fabric: Recap

Did I disappoint you?
Or leave a bad taste in your mouth?
You act like you never had love
And you want me to go without

U2

In previous posts, I shared my initial impression of the recently announced Microsoft Fabric and its main engines. Now that we have the Fabric licensing and pricing, I’m ready to wrap up my review with a few parting notes. Here is how I plan to position Fabric to my clients:

Enterprise clients

These clients have complex data integration needs. More than likely, they are already on a Power BI Premium contract and highly-discounted pricing model that is reviewed and renewed annually with Microsoft. Given that Fabric can be enabled on premium capacities, you should definitely consider it selectively when it makes sense. For now, I believe a good case can be made for data lake and lakehouse if that’s your thing.

Now you have an alternative to Databricks and you can standardize BI on one platform and vendor.

I don’t have experience in Databricks to offer more in-depth comparison, but in my opinion the most compelling features to favor Fabric for now are:

  1. No additional cost or Power BI Premium capacity upgrade if you aren’t reaching the workload limits
  2. One platform and one vendor to avoid the blame game when things don’t work
  3. Fast Direct Lake data access for ad-hoc analysis directly on top of files in the lakehouse
  4. Easy data virtualization

If you decide in Fabric’s favor, you’d be wise to reduce dependencies on Microsoft proprietary and bundled features, such as Power Query dataflows and data pipelines insides Fabric (I’d use a stand-alone ADF instance once ADF supports Fabric). Hopefully, bring-your-own-lake will appear on day to circumvent the Fabric OneLake shortcomings.

Small and medium-size clients

Unfortunately, Microsoft didn’t make Fabric available with PPU (premium-per-user) licensing. This would surely put it out of reach for smaller organizations. True, you can purchase a Fabric F2 license for as little as $262/month and run it on a quarter of a core. I didn’t know a quarter of a core existed, but Microsoft did it, although you probably won’t get too far with it for production use (see results from my F2 limited performance tests here). You can opt for a higher SKU, but it would increase your bill and Fabric capacities can’t be auto-paused. For example, a “luxurious” 1 core (F8 plan) will put you in the 1K/month range, plus Power BI Pro licenses for all users (contributors and viewers).

But fear not. There is nothing in Fabric that you desperately need or can’t obtain outside Fabric in a much more cost-effective way.

Expect Microsoft to push Fabric aggressively. However, I believe Fabric has more appeal for large organizations while low-budged simple solutions with Power BI Pro or PPU licensing would likely better address your needs. And your BI solution is still going to be “modern”, whatever that means…

Fabric Semantic Modeling: The Good, the Bad, and the Ugly

In retrospect, I’d say I owe 50% of my BI career to Analysis Services and its flavors: Multidimensional, Tabular, and later Power BI. This is why I closely follow how this technology evolves. Fast forwarding to Fabric, there are no dramatic changes. Unlike the other two Fabric Engines (Lakehouse and Warehouse), Power BI datasets haven’t embraced the delta lake file format to store its data yet. The most significant change is the introduction of a new Direct Lake data access mode alongside the existing Import and DirectQuery.

The Good

Direct Lake will surely enable interesting scenarios, such as real-time BI on top of streaming data. It requires Parquet delta lake files and therefore it’s available only when connecting to the Lakehouse managed area (Tables folder) and Warehouse tables. Given that Parquet is a columnar format, which is what Tabular VertiPaq is, basically Microsoft has changed the engine to read Parquet files as it does with its proprietary IDF file format.

The primary usage scenario is fast analysis on top of large data volumes without importing data and without delegating the query to another server. Therefore, think of Direct Lake is a hybrid between Import and DirectQuery modes. By “large data volumes”, I mean data that otherwise won’t fit into memory and/or it will require substantial time to refresh but low latency access would be preferable.

Microsoft has accomplished this feat by using the following existing and new Analysis Services features:

  • Vertiscan – The ability for Analysis Services to query columnar storage. Instead of using the IDF file format to store the Vertipaq data, DirectLake instead uses the Parquet file format in Lakehouse or Warehouse. The AS engine loads the data from the Parquet files (with some extra effort) and maps the column values into (mostly) the same data structures that would have been used if the data was coming from IDF files. After that, Vertiscan is querying the data as if it was Import data, so query performance should be at par with Import mode.
  • On-demand data loading – The ability to page in and out data that was introduced in 2021 for imported data. If the data needs to be paged in, there will be some delay but after that it will be fast until and unless it gets paged out later on. Chris Webb covers on-demand loading in his post On-Demand Loading Of Direct Lake Power BI Datasets In Fabric.
  • V-order – an extension to the Parquet file format to get a better compression like VertiPaq

The Bad

Naturally, I’d like to see Direct Lake available outside Fabric.

Currently, here is what needs to happen to connect to external Delta Parquet files, such as files located in ADLS:

  • Create a lakehouse.
  • Create a shortcut in OneLake to the external source table.
  • Create the dataset on top of the lakehouse

As you can see, you can’t escape the Fabric gravitational pull to get Direct Lake. Further, the Parquet files produced by the Fabric workloads (Lakehouse/DW/etc.) will typically be faster and more compressed because of the V-order compression.

The Ugly

Among the Direct Lake limitations, the most significant for me is that not only you need Fabric to get Direct Lake, but also you must create the dataset online using the “New Power BI dataset” feature in Lakehouse/Warehouse, which has its own limitations.

Therefore, for now you can’t use Power BI Desktop to create your semantic model that uses Direct Lake connectivity. This will require Write support to be added to the Analysis Services Power BI XMLA endpoint. However, once you create the Direct Lake dataset, you can use Power BI Desktop to connect to it using the OneLake Data Hub connector.

Power BI Projects and Git Integration: The Good, the Bad, and the Ugly

Is it getting better?
Or do you feel the same?
Will it make it easier on you now?
You got someone to blame

U2

Likely influenced by the Gartner’s Fabric vision, Microsoft Fabric is eclipsing all things Power BI nowadays to the extent of replacing the strong Power BI brand and logo with Fabric’s. Now that the red star Synapse has imploded into a black hole, Fabric has taken its place and it’s engulfing everything in its path.

But to digress from Fabric, let’s take a look at two developer-oriented and frequently requested features that fortunately doesn’t require Fabric: Power BI Desktop projects and workspace integration with Git. The video in the link is a good starting point to understand how these features work.

Basically, the first feature lets you save a Power BI Desktop file as a *.pbip file which generates a set of folders with human-readable text files, such as the model.bim file (has the model definition described in JSON). Next, you can put these files under source control using any source control-enabled tool, such as Visual Studio Code or Visual Studio. Currently in preview, Power BI projects is an experimental feature (in fact, the Publish button is disabled so you can’t publish a project to Power BI Service).


The second feature (workspace integration with Git) lets you configure a workspace to keep Power BI reports and datasets published to the workspace under source control. No other items are currently supported (even paginated reports). This feature is independent of the client-side project mode and it doesn’t require saving Power BI Desktop files as projects.

The Good

Although the primary target of PBI Desktop projects is to enable source control, a more useful side effect to me is that we can now get to and edit the report schema.

The other day someone asked me how you can apply conditional settings of one field in a visual to another. Or, how can you replace the field that has conditional settings applied without loosing them. Well, with some JSON wizardry, you can open the report.json file and make the changes there. At your risk of course because the report schema is ugly and not documented.

I also commend Microsoft for supporting Power BI workspace Git integration in Premium Per User (PPU) licensing to make it more accessible (unfortunately, it’s not available to the masses operating on Power BI Pro). I like the two-way integration in a Git-enabled workspace. It works like Azure Data Factory, so I don’t have to configure source control on the client.

The Bad

The report JSON schema is not documented and as such source code compare is difficult. Parsing a name-value collection in the visual’s config section is required to get to the useful staff.

I hope at some point Microsoft will simplify and document the report schema to let the community take over where Microsoft left off and create tools to compare the report definition.

Naturally, I’d like to see Git integration supporting all workspace items, such as paginated reports, dashboards, and Fabric artifacts.

The Ugly

If you follow my blog, you know that I’m not a big fan of Power BI Desktop for semantic modeling. Instead, I rely on Tabular Editor. When in pbix mode, changes saved in Tabular Editor appear in Power BI Desktop (the reverse is not true). That’s because Tabular Editor connected to a pbix file actually connects to the running instance of the Analysis Services engine and applies changes there. It edits the same in-memory model that PBI Desktop is connected to, thus allowing PBI Desktop to automatically refresh metadata changes.

By contrast, changes made in Tabular Editor in project mode are not reflected in Power BI Desktop.

That’s because changes made in Tabular Editor in project mode are written to the model.bim file and you need to close and open PBI Desktop for these changes to appear. This can get tedious very quickly. My recommendation would be for Power BI Desktop to listen to file changes and automatically refresh the metadata.

Summary

I’ll wait for the project mode feature to mature on the client. Meanwhile, I’ll continue with *.pbix and enable source control integration at the workspace level for clients on PPU or higher licensing. With Power BI Pro licensing, I’ll continue putting the model.bim file from Tabular Editor under source control (no source control for reports).

Fabric Data Integration: The Good, the Bad, and the Ugly

Oops, I did it again
I played with your heart
Got lost in the game…

Britney Spears

In previous posts, I shared my thoughts about Fabric OneLake, Lakehouse, and Data Warehouse. They are of course useless if there is no way to get data in and out of Fabric. Data integration and data quality is usually the most difficult part of implementing a BI solution, accounting for 60-80% of the overall effort. Therefore, this post is about Fabric data integration options.

Fabric supports three options for automated data integration: Data Pipeline (Azure Data Factory pipeline), Dataflow Gen2 (Power BI dataflow), and Notebook (Spark). I summarize these three options in the following table, which loosely resembles the Microsoft comparison table but with my take on it.

Data pipeline
(ADF pipeline/copy activity)
Dataflow Gen2
(Power BI dataflow)
Notebook
(Spark)
Primary userBI developerBusiness analystData scientist, Developer
Patterns supportedETL/ELTETLETL
Primary skillsetSQLPower QuerySpark
Data volumeHighLow to mediumHigh
Primary code languageSQLMScala, Python, Spark SQL, R
ComplexityMediumLowHigh
Vendor lock-inMedium (minimize with ELT pattern)HighLow

The Good

We have three options for data integration to support different personas and skillsets. Unlike other “a notebook with a blinking cursor” vendors, data pipeline and dataflow provide no code/low code options.

Power BI dataflows are now supposedly more scalable, which apparently justifies the Gen2 tag. They finally support destinations although the list is limited to Azure SQL Database and the Fabric engines (Lakehouse, Warehouse, and KQL Database).

The Copy activity in Fabric data pipelines now supports creating delta tables although it doesn’t support merges.

The Bad

Microsoft is pushing dataflows to “data engineer, data integrator, and business analyst”. My guidance is to consider dataflows only if you want to open data ingestion to business users (something you must carefully think about and definitely surround it with a log of supervision). As its predecessor (Power BI dataflows), Power Query is notoriously difficult to troubleshoot or optimize. It doesn’t support the ELT pattern (my favorite), such as to handle Type 2 changes. This could be partially ramified by implementing a pipeline that mixes dataflows with other ADF artifacts, such as calling stored procedures in Fabric Warehouse. Moreover, I consider Power Query as a Microsoft proprietary tool, irrespective that the M language is documented. If one day you decide to leave Fabric, you’d need to rewrite your flows. Finally, the only output options supported are append or replace (no update).

Moving to ADF, the copy activity supports only append or replace (no update). Outside Fabric, Azure Data Fabric doesn’t have connectors to connect to Fabric yet.

I personally abhor the idea to put all BI artifacts in Fabric, if not for anything else, but to have a better way out if one day a client decides to part ways with Fabric.

Haven’t we learned anything from Synapse pipelines? Ask Microsoft how to migrate them to Fabric if you have fallen for that “best practice”. I’d carefully weight going all the way with Fabric (I know that bundles are a big incentive) instead of being more independent and use Fabric more selectively.

For data warehousing, which is the primary scenario I personally care about in Fabric, I primarily rely on the ELT pattern for a variety of reasons. I shall miss T-SQL MERGE in Fabric Warehouse, but I plan to leave it to marinate for a year or so anyway.

The Ugly

After all the push to use ADF mapping data flows in Synapse, where are they hiding in Fabric? Alas, they haven’t made it and they were superseded by Power BI dataflows.

This underscores another important reason to use the ETL pattern whenever you can. At least you can salvage your SQL code as a vendor “evolves” or “revolutionizes” their offerings. Which is another way of a vendor saying “oops, we did it again…” and we shall go back to the drawing board.

Atlanta Microsoft BI Group Meeting on July 11th (Azure OpenAI – Answers to Your Natural Language Questions)

Atlanta BI fans, please join us for the next meeting on Tuesday, July 11th, at 6:30 PM ET.  Stacey Jones (Cross Solution Architect at Microsoft) will join us in person to present OpenAI and Copilot. Your humble correspondent will demo the newly released PBI Desktop project format. For more details and sign up, visit our group page.

PLEASE NOTE A CHANGE TO OUR MEETING POLICY. EFFECTIVE IMMEDIATELY, WE ARE DISCONTINUING ONLINE MEETINGS VIA TEAMS. THIS GROUP MEETS ONLY IN PERSON. WE WON’T RECORD MEETINGS ANYMORE. THEREFORE, AS DURING THE PRE-PANDEMIC TIMES, PLEASE RSVP AND ATTEND IN PERSON IF YOU ARE INTERESTED IN THIS MEETING.

Presentation: Azure OpenAI – Answers to Your Natural Language Questions

Date: July 11th (Please note that because of the July 4th holiday, this meeting is on Tuesday)

Time: 18:30 – 20:30 ET

Level: Intermediate

Food: As of now, food won’t be available for this meeting. We are welcoming suggestions for a sponsor.

Agenda:

18:15-18:30 Registration and networking

18:30-19:00 Organizer and sponsor time (events, Power BI latest, sponsor marketing)

19:00-20:15 Main presentation

20:15-20:30 Q&A

ONSITE

Improving Office

11675 Rainwater Dr

Suite #100

Alpharetta, GA 30009

Overview: We will overview and explore Azure OpenAI, a fusion of OpenAI with the Azure cloud. Understand what OpenAI and CoPilot are as well as how they can help you with your natural language questions. We will also address questions about AI, such as will AI take over the world?

 Speaker: Stacey Jones is a Cross Solution Architect specializing in all things Data, AI and Power BI. He is Evangelist at Microsoft’s Technology Center (MTC) in Atlanta, GA. He has 30+ years of industry experience in technology management spanning data architecture, data modeling, database development, tuning, and administration in both Oracle and SQL Server. He is active on LinkedIn and has plenty of real-life experience putting out database performance fires.

PowerBILogo

Fabric Data Warehouse: The Good, The Bad, and the Ugly

“Patience my tinsel angel, patience my perfumed child
One day they’ll really love you, you’ll charm them with that smile
But for now it’s just another Chelsea [TL: BI] Monday”
Marrillion

Continuing our Power BI Fabric journey, let’s look at another of its engines that I personally care about – Fabric Warehouse (aka as Synapse Data Warehouse). Most of my real-life projects require integrating data from multiple data sources into a centralized repository (commonly referred to as a data warehouse) that centralizes trusted data and serves it as a source to Power BI and Analysis Services semantic models. Due to the venerable history of relational databases and other benefits, I’ve been relying on relational databases powered by SQL Server to host the data warehouse. This usually entails a compromise between scalability and budget. Therefore, Azure-based projects with low data volumes (up to a few million rows) typically host the warehouse in a cost-effective Azure SQL Database, while large scale projects aim for Synapse SQL Dedicated Pools. And now there is a new option on the horizon – Fabric Warehouse. But where does it fit in?

The Good

Why would you care about Fabric Warehouse given that Microsoft has three (and counting) Azure SQL SKUs: Azure SQL Database, Azure SQL Managed Instance, and Synapse SQL Dedicated Pools (based on Azure SQL Database)? Ok, it saves data in an open file format in OneLake (delta tables), but I personally don’t quite care much about how data is saved. We could say that SQL Server is also (although proprietary) a file-based engine that served us well for 30+ years. I care much more about other things, such as performance, budget, and T-SQL parity. However, as I mentioned in the previous Fabric-related blogs, the common delta format across Fabric enables data virtualization, should you decide to put all your eggs in one basket and embrace Fabric for all your data needs. Let’s say you have some files in the lakehouse managed zone. You can shortcut lakehouse delta tables to the Data Warehouse engine and avoid data movement. By contrast, previously you had to spin more ETL to move data from the Synapse serverless endpoint to the Synapse dedicated pools.

A big improvement is that you can also cross-query data warehouses and lakehouses, which wasn’t previously possible in Azure Synapse.

You can do this by writing familiar T-SQL queries with three-part-naming, such as querying a lakehouse table in your warehouse:
SELECT TOP (100) * FROM [wwilakehouse].dbo.dimension_customer

In fact, the elimination (or rather fusion) of Synapse Dedicated Pools and Serverless is a big plus for me. The infinite scale and instantaneous scalability that resulted from decoupling compute and store sounds interesting. I don’t know how Fabric Warehouse will evolve over time but I really hope that at some point, it will eliminate the afore-mentioned Azure SQL SKU decision for hosting relational data warehouses. I hope it will support the best of both worlds – the ability to dynamically scale (as serverless Azure SQL Database) up to the large scale of Synapse Dedicated Pools.

The Bad

Now that Azure SQL SKUs are piling up, Microsoft owes us serious benchmarks for Fabric Warehouse analytical and transactional loads.

Speaking of loads and considering that Fabric Warehouse saves data in columnar structures (Parquet files), this is not a tool to use for heavy OLTP loads. Columnar formats fit data analytics like a glove when data is analyzed by columns, but they will probably perform poorly for heavy transactional loads. Again, benchmarks could help determine where that cutoff point is.

Unlike Power BI Premium (P) capacities, Fabric capacities acquired by purchasing an Azure F plan can be scaled up or down (like Power BI Embedded A* plans). However, like A* plans, this requires manual intervention. I was hoping for dynamically scaling up and down workloads within a capacity to meet more demand and reduce cost. For example, I should be able to configure the capacity to auto-pause Fabric Warehouse when it is not in use and the data is imported in a Power BI dataset.

As I mentioned in my Fabric Lakehouse: The Good, The Bad, and the Ugly post, given that both saves data in the same format, I find the Lakehouse and Warehouse engines redundant and confusing,. Currently, ADLS shortcomings are implicated for this separation and my hope is that at one point they will merge. If you are a SQL developer, you should be able to use SQL, and if you prefer notebooks for whatever reason, you can use Python or whatever language you prefer on the same data. This will be really nice and naturally resolve the lakehouse-vs-warehouse decision point.

The Ugly

Due to the fact that it’s completely rewritten, a work in progress, and therefore subject to many limitations, as it stands Fabric Warehouse is unusable for me.

After the botched Synapse saga, Microsoft had the audacity to call this offering the next generation of Synapse warehouse (in fact, the Fabric architecture labels is as Synapse Data Warehousing, I guess to save face) despite the fact that they pretty much start from scratch at least in terms of features. I call it “Oops, we did it again, we overpromised and overmarketed it, got lost in the game, Ooh, baby, baby…” I’ll give Fabric Warehouse a year or so before I revisit. If my hope for Fabric Warehouse as a more straightforward choice for warehousing materializes, I also wish Microsoft decouples it from Fabric and makes it available as a standalone Azure SKU offering. Which is pretty much my wish for all tools in the Fabric bundle. Bundles are great until you decide to opt out and salvage pieces…

Finally, after all the push and marketing hoopla, Synapse SQL Dedicated Pools seems to be on an unofficial deprecated path. I guess Microsoft has seen the writing on the wall from competing with other large-scale DW vendors. The current guidance is to consider them for high-scale performance while the “evolved” and “rewritten” Fabric Synapse Warehouse should be at the forefront for future DW implementations. And there is no migration path from Dedicated SQL Pools.