Fabric Data Integration: The Good, the Bad, and the Ugly

Oops, I did it again
I played with your heart
Got lost in the game…

Britney Spears

In previous posts, I shared my thoughts about Fabric OneLake, Lakehouse, and Data Warehouse. They are of course useless if there is no way to get data in and out of Fabric. Data integration and data quality is usually the most difficult part of implementing a BI solution, accounting for 60-80% of the overall effort. Therefore, this post is about Fabric data integration options.

Fabric supports three options for automated data integration: Data Pipeline (Azure Data Factory pipeline), Dataflow Gen2 (Power BI dataflow), and Notebook (Spark). I summarize these three options in the following table, which loosely resembles the Microsoft comparison table but with my take on it.

Data pipeline
(ADF pipeline/copy activity)
Dataflow Gen2
(Power BI dataflow)
Notebook
(Spark)
Primary userBI developerBusiness analystData scientist, Developer
Patterns supportedETL/ELTETLETL
Primary skillsetSQLPower QuerySpark
Data volumeHighLow to mediumHigh
Primary code languageSQLMScala, Python, Spark SQL, R
ComplexityMediumLowHigh
Vendor lock-inMedium (minimize with ELT pattern)HighLow

The Good

We have three options for data integration to support different personas and skillsets. Unlike other “a notebook with a blinking cursor” vendors, data pipeline and dataflow provide no code/low code options.

Power BI dataflows are now supposedly more scalable, which apparently justifies the Gen2 tag. They finally support destinations although the list is limited to Azure SQL Database and the Fabric engines (Lakehouse, Warehouse, and KQL Database).

The Copy activity in Fabric data pipelines now supports creating delta tables although it doesn’t support merges.

The Bad

Microsoft is pushing dataflows to “data engineer, data integrator, and business analyst”. My guidance is to consider dataflows only if you want to open data ingestion to business users (something you must carefully think about and definitely surround it with a log of supervision). As its predecessor (Power BI dataflows), Power Query is notoriously difficult to troubleshoot or optimize. It doesn’t support the ELT pattern (my favorite), such as to handle Type 2 changes. This could be partially ramified by implementing a pipeline that mixes dataflows with other ADF artifacts, such as calling stored procedures in Fabric Warehouse. Moreover, I consider Power Query as a Microsoft proprietary tool, irrespective that the M language is documented. If one day you decide to leave Fabric, you’d need to rewrite your flows. Finally, the only output options supported are append or replace (no update).

Moving to ADF, the copy activity supports only append or replace (no update). Outside Fabric, Azure Data Fabric doesn’t have connectors to connect to Fabric yet.

I personally abhor the idea to put all BI artifacts in Fabric, if not for anything else, but to have a better way out if one day a client decides to part ways with Fabric.

Haven’t we learned anything from Synapse pipelines? Ask Microsoft how to migrate them to Fabric if you have fallen for that “best practice”. I’d carefully weight going all the way with Fabric (I know that bundles are a big incentive) instead of being more independent and use Fabric more selectively.

For data warehousing, which is the primary scenario I personally care about in Fabric, I primarily rely on the ELT pattern for a variety of reasons. I shall miss T-SQL MERGE in Fabric Warehouse, but I plan to leave it to marinate for a year or so anyway.

The Ugly

After all the push to use ADF mapping data flows in Synapse, where are they hiding in Fabric? Alas, they haven’t made it and they were superseded by Power BI dataflows.

This underscores another important reason to use the ETL pattern whenever you can. At least you can salvage your SQL code as a vendor “evolves” or “revolutionizes” their offerings. Which is another way of a vendor saying “oops, we did it again…” and we shall go back to the drawing board.