Fabric Lakehouse: The Good, The Bad, and the Ugly

“Hey, I got this lake house, you should see it
It’s only down the road a couple miles
I bet you’d feel like you’re in Texas
I bet you’ll wanna stay a while”
Brad Cox

Continuing our Fabric purview, Lakehouse is now on the menu. Let’s start with a definition. According to Databricks which are credited with this term, a data lakehouse is “a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.” Further, their whitepaper “argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse, which will (i) be based on open direct-access data formats…” In other words, give us all of your data in a delta lake format, embrace our engine, and you can ditch the “legacy” relational databases for data warehousing and OLAP.

The Microsoft’s Lakehouse definition is less ambitious and exclusive. “Microsoft Fabric Lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location. It is a flexible and scalable solution that allows organizations to handle large volumes of data using a variety of tools and frameworks to process and analyze that data. It integrates with other data management and analytics tools to provide a comprehensive solution for data engineering and analytics”. In other words, a lakehouse is whatever you want it to be if you want something better than a data lake.

The Good

I like the Microsoft lakehouse definition better. Nobody forces me to adopt an architectural pattern that probably won’t stand the test of time anyway.

If I must deal with files, I can put them in the lakehouse unmanaged area. The unmanaged area is represented by the Files virtual view. This area somehow escapes the long-standing Power BI limitation that workspaces can’t have subfolders and therefore allows you to organize the files in any folder structure. And if I want to adopt the delta lake format, I can put my data in the lakehouse managed area, which is represented by the Tables virtual view.

Further, the Fabric lakehouse automatically discovers and registers delta lake tables created in the managed area. So, you can use a Power Query dataflow or ADF Copy Activity to write some data in a delta lake format and Fabric will register the table for you in the Spark metastore with the necessary metadata such as column names, formats, compression and more (you don’t have use Spark to register the table). Currently, the automatic discovery and registration is supported only for data written in the delta lake format.

You get two analytical engines for the price of one to query the delta tables in the managed area. Like Synapse Serverless, the SQL endpoint lets users and tools query the delta tables using SQL. And the brand new DirectLake mode in Analysis Services (more on this in a future post) lets you create Power BI datasets and reports that connect directly to the delta tables with blazing speed so you don’t have to import to get better performance.

Finally, I love the ability to shortcut lakehouse tables to the Data Warehouse engine and avoid data movement. By contrast, previously you had to spin more ETL to move data from Synapse serverless endpoint to the Synapse dedicated pools.

The Bad

As the most lake-like engine, Fabric lakehouse shares the same limitations as the Fabric OneLake which I discussed in a previous post.

Going quickly through the list, the managed area (Tables) can’t be organized in subfolders. If you try to paint outside the “canvas”, the delta tables will end up in an Unidentified folder and they won’t be automatically registered. Hopefully, this is a preview limitation since I don’t see you can implement a medallion file organization if that’s your thing. Further, while you can create delta tables from files in the unmanaged area, the tables are not automatically synchronized with changes to the original files. Automatic delta table synchronization could be very useful that I hope will make its way to the roadmap.

The Microsoft API wrapper that sits on top of OneLake is married to the Power BI security model. You can’t overwrite or augment the security permissions, such as by granting ACL permissions directly to the lakehouse folders. Even Shared Access Signature is not allowed. Storage account key is not an option too as you don’t have access to the storage account. So, to get around such limitations, you’d have to create storage accounts outside Fabric, which defeats the OneLake vision.

It’s unclear how Fabric would address DevOps, such as Development and Production environments. Currently, a best practice is to separate all services so more than likely Microsoft will enhance Power BI pipelines to handle all Fabric content.

The Ugly

Given that both the lakehouse and data warehouse (new Synapse) have embraced delta lake storage, it’s redundant and confusing to have two engines.

The documentation goes in length to explain the differences but why the separation? Based on what I’ve learned, the main reason is related to ADLS limitations to support important SQL features, including transactions that span multiple tables, lack of full T-SQL support (no updates, limited reads) and performance issues with trickle transactions. Hopefully, one day these issues will be resolved and the lakehouse and data warehouse will merge to give us the most flexibility. If you know SQL, you’d use SQL and if you prefer notebooks, you can use the Spark-supported languages to manipulate the same data.

PROLOGIKA NEWSLETTER SUMMER 2023

Enterprise Resource Planning (ERP) systems capture a wealth of information from main business processes within an organization, such as Sales, Inventory, Orders, etc. Therefore, a BI journey typically starts with analyzing the ERP data, with revenue analytics usually taking the center stage because of its vital metrics, such Revenue, Margin, Profit. In fact, this scenario has repeated so often for the past few years that Prologika has developed a fast-track implementation for ERP analytics. The client presented in this case study had more complicated integration needs, but other projects for both larger and smaller clients in the same or other industries have benefited from the same methodology.

Business Needs

The client manufactures and distributes electrical components. The company has grown rapidly through acquisitions, with every business unit having its own ERP system, including Dynamics and other business-specific ERPs. Some ERPs were installed on premises, such as Dynamics that stores data in a SQL Server database on the company’s data center, and others were cloud systems.

The client realized that it’s difficult to analyze performance across business units and disperse ERP systems. They envisioned a centralized and fully automated reporting solution that will consolidate data from all ERPs into a single repository so they can track sales across business units, sales people, vendors, and customers. In addition, they wanted their developers to enhance and maintain the solution, so simplicity and knowledge transfer were paramount.

Solution

After assessing the current state and objectives, Prologika recommended and implemented the following cloud architecture:

Prologika adopted an iterative approach to integrate the data from all five ERP systems where each ERP had its own iteration. Laying out the foundation, the first iteration was the most important and had the following main deliverables:

  1. Schema-first design – Prologika analyzed the requirements for Sales analytics and designed a star schema. The schema design was crucial because all other layers, such as ETL and semantic model, depend on it.
  2. Endpoints – Prologika worked with the ERP subject matter expert to implement endpoints that will provide a data feed for each table in the schema.
  3. ETL– Prologika used Azure Data Factory (ADF) no-code pipelines. The integration effort was dramatically reduced because all ERP endpoints would adhere to the same specification and hide the underlying details for each specific ERP, such as which tables would need to be joined. For best performance and fast knowledge transfer, Prologika followed the ELT pattern to transform and load the staged data using T-SQL stored procedures.
  4. Data warehouse – Prologika staged, transformed, and loaded the data into a centralized data warehouse repository whose structure was based on the star schema design in step 1.
  5. Semantic model – Prologika implemented a Power BI semantic model that delivered instantaneous insights and single-version of truth irrespective of the reporting tool used (Power BI, Excel, or third-party tools). The semantic model included the desired business metrics, such as time intelligence, variances, as well as data security.
  6. Reports – Power BI report “packs”, such as Sales Analytics, Order Analytics, were developed that connected to the semantic model. Users also could use Excel and other tools to create pivot reports without extracting and duplicating data.

Benefits

The solution delivered the following benefits to the client:

  • Fast data integration – Each subsequent ETL iteration could benefit from the common architecture.
  • Data consolidation – Data was consolidated into a single repository.
  • Low licensing cost – The solution had a licensing cost in the range of $400/month, with the most spent in Power BI licensing.
  • Easy maintenance – The client was able to quickly learn, maintain, and enhance the solution.


Teo Lachev
Prologika, LLC | Making Sense of Data
logo

Fabric OneLake: The Good, the Bad, and the Ugly

To know a tool is to know its limitations – Teo’s proverb

In a previous post, I shared my overall impression of Fabric. In this post, I’ll continue exploring Fabric, this time sharing my thoughts on OneLake. If you need a quick intro to Fabric OneLake, the Josh Caplan’s “Build 2023: Eliminate data silos with OneLake, the OneDrive for Data” presentation provides a great overview of OneLake, its capabilities, and the vision behind it from a Microsoft perspective. If you prefer a shorter narrative, you can find it in the “Microsoft OneLake in Fabric, the OneDrive for data” post. As always, we are all learning and constructive criticism would be appreciated if I missed or misinterpreted something.

What’s Fabric OneLake?

In a nutshell, OneLake is a Microsoft-provisioned storage where the ingested data and the data from the analytical (compute) engines are stored (see the screenshot below). Currently, PBI datasets (Analysis Services) are not saved in OneLake although one could anticipate that in the long run all Power BI data artifacts could (and should) end up in OneLake for Power BI Fabric-enabled workspaces.

Behind the scenes, OneLake uses Azure Data Lake Storage (ADLS) Gen2, but there is an application wrapper on top of it that handles various tasks, such as management (you don’t have to provision storage accounts or external tables), security (Power BI security is enforced), and governance. OneLake is tightly coupled with the Power BI catalog and security. In fact, you can’t create folders outside the Power BI catalog so proper catalog planning is essential. When you create a Fabric workspace in Power BI, Microsoft provisions an empty blob container in OneLake for that workspace. As you provision additional Fabric Services, Fabric creates more folders and save data in these folders. For example, I can use Azure Storage Explorer to connect to a Fabric Lakehouse Tutorial workspace (https://onelake.blob.fabric.microsoft.com/fabric lakehouse tutorial) and see that I have provisioned a data warehouse called dw and a lakehouse called wwilakehouse. Microsoft has also added two additional system folders for data staging.

The Good

I like to have the data from the compute engines saved in one place and to be able to access that data as delta Parquet files. There is something serene to have this transparency after decades of not being able to pick under hood of some proprietary file format.

I also like very much the ability to use different engines to access that data (Microsoft calls this OneCopy although a more appropriate term in my opinion would be OneShare).

For example, I can query a table in the lakehouse directly from the data warehouse (such as select * from [wwilakehouse].dbo.dimension_customer) without having to create PolyBase external tables or use the serverless endpoint.

I welcome the possibility for better data virtualization both within and outside the organization. For example, in a lakehouse I can create shortcuts to other Fabric lakehouses, ADLS folders, and even external storage systems, such as S3 (Google and Dataverse coming up).

I’d like the governance aspect and the possibility it enables, such as lineage tracking and data security. In fact, OneSecurity is on the roadmap, where once you secure the data at OneLake, the data security bubbles up. It would be interesting to see how this will work and its limitations, as I’d imagine it won’t be as flexible as Power BI RLS.

The Bad

An API wrapper is a double-edged sword because it wraps and abstracts things. You really have to embrace the Power BI catalog and its security model along with their limitations, because there is no way around it.

For example, you can’t use Azure Data Explorer to change the ACL permissions on the folder or create folders where OneLake doesn’t like them (OneLake simply ignores certain APIs). Isn’t this a good thing though to centralize at a higher level, such as the Power BI workspace and prevent users to pain outside the canvas? Well, how about granting some external vendor permissions to write to OneLake, such as to upload files. From what I can see, even Shared Access Signature is not allowed. Storage account key is not an option too as you don’t have access to the storage account. So, to get around such limitations, you’d have to create separate storage accounts, but this defeats the promise of one lake.

Perhaps, this is an example that you don’t care much about. How about the lakehouse medallion architecture pattern which is getting popular nowadays for lakehouse-centric implementations. While you can create subfolders in the Lakehouse “unmanaged” zone (the Files folder), no such luck with the Tables folder to organize your delta lake tables in Bronze, Silver, and Gold zones. I can’t see another solution but to create a Power BI workspace for each zone.

Further, what if you must address data residency requirements of some litigation-prone country? The above-cited presentation correctly states that content can be geo-located but that will require purchasing more Power BI capacities irrespective of the fact that ADLS storage accounts can be created in different geo locations. Somehow, I start longing about bringing your own data lake and I don’t think that provisioning storage accounts is that difficult (in fact, a Power BI workspace already has the feature to link to BYO ADLS storage to save the output of Power Query dataflows). Speaking of accounts, I’m looking forwards to seeing how Fabric would address DevOps, such as Development and Production environments. Currently, a best practice is to separate all services so more than likely Microsoft will enhance Power BI pipelines to handle all Fabric content.

The Ugly

How can I miss another opportunity to harp again on Power BI for the lack of hierarchical workspaces? The domain feature that Microsoft shows in the presentation to logically group workspaces could be avoided to a great extent if Finance could organize content in subfolders and break security inheritance when needed, instead of ending up with workspaces, such as Finance North America, Finance EMEA, etc. Since OneLake is married to the Power BI catalog, more catalog flexibility is essential.

Given that OneLake and ADLS in general would be a preferred location for business users to store reference data, such as Excel files, Microsoft should enhance Office to support ADLS. Currently, Office apps can’t directly open and save files in ADLS (the user must download the file somehow, edit it, and then upload it back to ADLS). Consequently, business users favor other tools, such as SharePoint Online, which leads to decentralizing the data required for analytics.