Posts

What Can Fabric Do For My Lake?

Previously, I discussed the pros and cons of Microsoft Fabric OneLake and Lakehouse. But what if you have a data lake already? Will Fabric add any value, especially if your organization is on Power BI Premium and you get Fabric features for free (that is, assuming you are not overloading your capacity resources)? Well, it depends.

Managed Area

A Fabric lakehouse defines two areas: managed and unmanaged. The managed area (Tables folder) is exclusively for Delta/Parquet tables. If you have your own data lake with Delta/Parquet files, such as Databricks delta lake, you can create shortcuts to these files or folders located in ADLS Gen 2 or Amazon S3. Consequently, the Fabric lakehouse would automatically register these shortcuts as tables.

Life is good in the managed area. Shortcuts to Delta/Parquet tables open interesting possibilities for data virtualization, such as:

  1. Your users can use the Lakehouse SQL Analytics endpoint to join tables using SQL. This is useful for ad-hoc analysis. Joins could also be useful so users can shape the data they need before importing it in Power BI Desktop as opposed to connecting to individual files and using Power Query to join the tables. Not only could this reduce the size of the ingested data, but it could also improve refresh performance.
  2. Users can decide not to import the data at all but build semantic models in Direct Lake mode. This could be very useful to reduce latency or avoid caching large volumes of data.

Unmanaged Area

Very few organizations would have lakes with Delta Parquet files. Most data lakes contain heterogeneous files, such as text, Excel, or regular Parquet files. While a Fabric lakehouse can create shortcuts to any file, non Delta/Parquet shortcuts will go to the unmanaged area (Files folder).

Life is miserable in the unmanaged area. None of the cool stuff you see in demos happens here because the analytical endpoint and direct lake modes are not available. A weak case can still be made for data virtualization that shortcuts bring data readily available to where business users collaborate in Power BI: the Power BI workspace.

But what can the user do with these unmanaged shortcuts? Not much really. Power BI Desktop doesn’t even expose them when you connect to the lakehouse. Power BI dataflows Gen2 do give the user access to the Files folder so potentially users can create dataflows and transform data from these files.

Of course, the tradeoff here is that you are adding dependencies to OneLake which could be a problem should one day you decide to part ways. Another issue could be that you are layering Power BI security on top of your data lake security.

Oh yes, users can also load Parquet and CSV files to Delta tables by right-clicking a folder or a file in the Unmanaged area, and then selecting Load to Tables (New or Existing). Unfortunately, as it stands, this is a manual process that must be repeated when the source data changes.

Imagined Unmanaged Data Virtualization

This brings me to the two things that I believe Microsoft can do to greatly increase the value proposition of “unmanaged” data virtualization:

  1. Extend load to table to the most popular file formats, such as JSON, XML, and Excel. Or, at least the ones that Polybase has been supporting for years. Not sure why we have to obsess with Delta Parquet and nothing else if Microsoft is serious about data virtualization.
  2. Implement automatic synchronization to update the corresponding Delta table when the source file changes.

If these features are added, throwing Fabric to the mix could become more appealing.

In summary, Microsoft Fabric has embraced Delta Parquet as its native storage file format and has added various features that targets it. Unfortunately none of these features extend to other file formats. You must evaluate pros and cons when adopting Fabric with existing data lakes. As it stands, Fabric probably wouldn’t add much business value for data virtualization over file formats other than Delta Paquet files. As Fabric matures, new scenarios might be feasible to justify Fabric integration and dependency.

 

 

Make a Pledge and Trouble is at Hand: Microsoft AI and Fabric Copilots

Rounding up the wisdom from the inscriptions of the Apollo Temple (see the Nothing in Excess and Know Thyself posts), the last known famous inscription was “Make a pledge and trouble is at hand.” Although the original intention was lost in time, it’s believed that this inscription instructs us to understand our limits so we don’t overpromise.

Fast forward to modern data analytics, I remember how Microsoft Power BI started less than 10 years ago with the grand promise to democratize BI to the masses. One would believe that AI would be an essential ingredient to fulfill that promise. To its credit, Power BI includes various built-in and useful ML features, such as Analyze Increase/Decrease (my favorite), Decomposition Tree, Key Influencers, Get Insights, and Q&A.

It’s surprising to me though that Fabric copilots are an ultra premium feature as they require at least P1 or its Fabric equivalent of F64 capacity with a price tag of $8,600 per month. As such, they will be an option only for large organizations that can afford this budget. Apparently, Microsoft sees copilots as the cherry on top of the Fabric pie that will commit undecided customers to premium spending. What happened to democratization and increased productivity? Domains of the rich?

I can’t try the Fabric copilots because they aren’t available even with Fabric Trial. But judging from the demos Microsoft provided, I’m not that impressed with them anyway. They all seem poor code generators on top of the Microsoft OpenAI service.

I remember how Q&A and natural interfaces were supposed to revolutionize data analytics. Now I don’t know of anyone using Q&A, but it can certainly produce exciting demos. As another footnote to Gartner-inspired grand visions, Fabric copilots will probably follow the same fate with a faster acceleration because of their hefty price tag.

Nothing in Excess: Deciphering Data Architectures by James Serra

In my last post, I talked about the lessons learned from the ancient Greeks. The second inscription on the temple of Apollo in Delphi was “Nothing in Excess.” Yet, inspired to be modern and embracing vendor’s propaganda, it’s my experience that companies tend to overdo their data architectures. I talked about this in this newsletter. Another excellent resource that could help you strike a balance is the James Serra’s latest book “Deciphering Data Architectures.” I was privileged to know James Serra for a while, have him multiple times present to our Atlanta BI Group, and be an early reviewer of this book.

Besides consolidating the wealth of information from his excellent blog, James took a vendor-neutral approach to various patterns. Each pattern describes the architecture followed by pros and cons analysis.

After all, the right architecture and “best practice” is what works for you given your specific circumstances and restrictions. Nothing more, nothing less. And you’ll still be “modern”, whatever that means.

How do I personally avoid excess in my data architectures? By keeping them simple. Although there are exceptions when it makes sense, the general tenants for most implementations are:

  1. Stage only the data I need instead of staging all source tables. Yes, this requires joining tables at the source, which is what the relational databases are designed to do (usually that’s the source).
  2. If the data is hosted in relational databases, stage directly to the cloud data warehouse and bypass intermittent staging to a lake. In general, I hate working with files.
  3. If the data is provided as files, such as reference data or source data downloaded or uploaded as files, stage it to a data lake.
  4. Avoid REST APIs of cloud vendors like the plague unless there is no other way.
  5. Do incremental data extraction when necessary (e.g. above a few million rows) and possible.
  6. Embrace the ELT pattern.

Are You Modern Yet?

If you’re a BI architect responsible for designing Azure-based cloud BI solutions and you’re not following the James Serra’s blog, you’re missing a lot. Previously an independent consultant and MVP, James works now as a Big Data/Data Warehouse Evangelist at Microsoft. I can’t agree more with his latest blog “Should I load structured data into my data lake?

We’ve heard a lot lately about what the “modern” data warehouse architecture should look like from Microsoft and other vendors. Here is the diagram taken from, you guessed it, the Microsoft Modern Data Warehouse page.

112918_0247_AreYouModer1.png

The problem is that many organizations take this diagram verbatim when it makes sense and when it doesn’t. For example, a client exports data from on-premises relational databases to text files in Azure Data Lake, only to load it later in a cloud data warehouse. To my observation that this is redundant and pointless, they answered that I shouldn’t deemphasize the “whole Big Data” architecture should one day their users decide to do machine learning. To my observation that should this moment arrive, data scientists will benefit more if the data is readily accessible in a relational database, they said that it contradicts what Microsoft recommended.

Of course, this “modern” architecture makes a lot of sense to the vendor (it’s a nice and perpetual revenue stream) or the consulting partner (the more stuff, the higher the bill) but more than likely is an overkill for you. Here are the most common issues I see organizations making on their BI journey to the cloud:

  1. Excessive data movement – The more you move the data, the more problems. You know it and I know it. Data quality issues, latency, maintenance and endless troubleshooting, etc. In another case, I saw a Cosmos DB sneaked in between, just in case there is too much structure in the source data.
  2. Architecture “gold plating” – A classic example, which the diagram perpetuates, is Azure SQL Data Warehouse. If you have a few million rows of data in your data center, would you buy Teradata? Of course, you wouldn’t! Now, just because you can afford the monthly premium for Azure SQL Data Warehouse, why would you go for an MPP platform? Not to mention that more than likely you’d have to redo your entire ETL if you are migrating from premises as there is still no T-SQL parity. I wouldn’t even consider Azure SQL Data Warehouse for workloads less than 1 TB. And not even then because a semantic model will probably do the heavy lifting. And on ETL, where Microsoft recommends ADF, read my thoughts on it here.
  3. Migrating to the cloud to save cost – No vendor would let its cloud (or whatever) alternative cannibalize another offering. For me, your primary goal for migrating to the cloud should be not cost savings but reliability and convenience. Try to get your IT department to promise SLA or promise anything for that matter. Or provision a VM in minutes? Cost alone, more than likely you’ll spend more moving to the cloud, especially if you go PaaS. For example, a client found that their bill went up astronomically because each BI PaaS offering (Database Engine, Analysis Services, ETL, Power BI) has its own pricing compared to covering all of these with a single SQL Server EE license in their data center. The “modern” architecture above could easily run $50,000/mo which is what a perpetual SQL Server EE license will cost you but over years.

They asked a famous writer one day when he knows that the manuscript is done, and he said when there is nothing more to take out. This principle should apply to BI architectures too. What can you take out to meet the business requirements without overcomplicating? It does cost a lot to be “modern”.