Atlanta Microsoft BI Group Meeting on September 11th (Introducing Lakehouse in Microsoft Fabric)

Atlanta BI fans, please join us for the next meeting on Monday, September 11th, at 6:30 PM ET.  Shabnam Waston (BI Consultant and Microsoft MVP) will introduce us to the Lakehouse engine in Microsoft Fabric. Shabnam will also sponsor the meeting. Your humble correspondent will help you catch up on Microsoft BI latest. For more details and sign up, visit our group page.

PLEASE NOTE A CHANGE TO OUR MEETING POLICY. WE HAVE DISCONTINUED ONLINE MEETINGS VIA TEAMS. THIS GROUP MEETS ONLY IN PERSON. WE WON’T RECORD MEETINGS ANYMORE. THEREFORE, AS DURING THE PRE-PANDEMIC TIMES, PLEASE RSVP AND ATTEND IN PERSON IF YOU ARE INTERESTED IN THIS MEETING.

Presentation: Introducing Lakehouse in Microsoft Fabric

Delivery: Onsite

Date: September 11th

Time: 18:30 – 20:30 ET

Level: Beginner to Intermediate

Food: Sponsor wanted

 

Agenda:

18:15-18:30 Registration and networking

18:30-19:00 Organizer and sponsor time (events, Power BI latest, sponsor marketing)

19:00-20:15 Main presentation

20:15-20:30 Q&A

 

VENUE

Improving Office
11675 Rainwater Dr
Suite #100
Alpharetta, GA 30009

Overview: Join this session to learn about Lakehouse architecture in Microsoft Fabric. Microsoft Fabric is an end-to-end big data analytics platform that offers many capabilities including data integration, data engineering, data science, data lake, data warehouse, and many more, all in one unified SaaS model. In this session, you will learn how to create a lakehouse in Microsoft Fabric, load it with sample data using Notebooks/Pipelines, and work with its built-in SQL Endpoint as well as its default Power BI dataset which uses a brand-new storage mode called Direct Lake.

Speaker: Shabnam Watson is a Business Intelligence consultant, speaker, blogger, and Microsoft Data Platform MVP with 20+ years of experience developing Data Warehouse and Business Intelligence solutions. Her work focus within the Microsoft BI Stack has been on Analysis Services and Power BI and most recently on Azure Synapse Analytics. She has worked across several industries including Supply Chain, Finance, Retail, Insurance, and Health Care. Her areas of interest include Power BI, Analysis Services, Performance Tuning, PowerShell, DevOps, Azure, Natural Language Processing, and AI. She is a regular speaker and volunteer at national and local user groups and conferences. She holds a bachelor’s degree in computer engineering and a master’s degree in computer science.

Sponsor: Shabnam Watson

PowerBILogo

Yet Another Currency Conversion Pattern

Currency conversion is a common requirement, such as when implementing analytics on top of ERP sales data recorded in multiple currencies. The most flexible approach is to allow the user to select a currency and perform the conversion at runtime, ideally by centralizing the conversion code in Power BI calculation groups. However, this approach has several shortcomings:

  1. Based on my experience, the most common requirement is to convert to a single corporate currency, such as USD, but provide the foreign subsidiaries an option to see the data in their local currency, such as for testing that the DW data is accurate. Therefore, the user should be presented with a slicer (or filter) with only two options: USD and Local. In the case of USD, all monetary measures will be converted to USD. In the case of Local, no conversion is needed.
  2. Centralizing the code in a calculation group won’t work with only USD and Local options, because you won’t be able to obtain the context (currency and date) for multiple fact tables and do the lookup.
  3. It incurs some performance hit to look up the exchange rate and check additional conditions.
  4. If you decide to centralize the code in a calculation group, you must filter which measures require conversion.

So, I take the middle path based on the following implementation points:

  1. I import the currency exchange rate that the Finance department uses in the data warehouse. Although the ERP system might store the exchange rate for each transaction, this is typically not the exchange rate Finance uses for the conversion.
  2. In SQL views that wrap the fact tables, I look up the rate based on the currency and data for each row, such as for each invoice line item. Consequently, the rate is readily available when the data is imported in Power BI and no runtime lookup is necessary.
  3. Since I use measure “wrappers” for each measure anyway, the actual conversion is done in the DAX measure, such as:
    GrossSales = 
    VAR _ReportCurrency = SELECTEDVALUE (ReportCurrency[Currency])
    RETURN
    SUMX (Sales, DIVIDE(Sales[GrossSalesBase], IF(_ReportCurrency = "Local", 1, Sales[RateDivisorToUSD])))
    

ReportCurrency could be implemented as a calculation group although it could also be a regular table. I opt for a calculation group in case I need to do something with the format string. Of course, the easiest solution is to format monetary measures as decimals without any currency symbol. The _ReportCurrency variable stores the option selected by the user. In case the user has selected “Local”, the IF expression returns 1 (no conversion is required); otherwise, it gets the exchange rate stored on the record. Finally, the formula divides the based measure (renamed to GrossSalesBase by the exchange rate (in this case, the exchange rate is a divisor) for each row in the table, and then sums up the result.

To recap, I believe this approach is simpler, faster, and more aligned with what you’ll encounter as requirements. On the downside, some code is repeated in each monetary measure.

Improving Data Quality

Poor data quality is one of the most challenging hurdles for successful data analytics. The story typically goes like this. The client has a flexible line of business system that people tend to use and abuse, such as entering garbage in free-form text fields that shouldn’t be free-form to start with. There is no data stewardship or data entry oversight. I’m not blaming anyone because people go about their business trying to get something done as fast as possible and BI is not on their minds. The issue is further complicated by acquisitions that usually bring more systems and more garbage. You and I’ve been there…the bigger and more complex the organization, the bigger the mess.

How do we solve this challenge? Should we buy a super-duper master data management (MDM) system to solve the data quality mess? Or should we kick the can down the road and tackle data quality in downstream systems, such as the data warehouse? In some cases, MDM can help but based on my experience, MDM systems are usually unnecessary. First, solving data quality issues down the road is probably a bad idea. Second, MDM systems bring additional complexity. ETL needs to be implemented to bring the data in and out, a business user must be delegated to correct data issues, another system needs to be learned and maintained, not to mentioned licensing costs. Instead, what’s really needed is to address data quality as upstream as possible, and that is the main data sources, such as the ERP system. The process is simple but tedious. Here are a few steps to get your team started on that road:

  1. Start by cleaning up the hierarchies because they are fundamental for top-level analysis. Request Management to come up with a standardized nomenclature in the form of hierarchies. What do we want our product hierarchy (Product Category, Subcategory,…) to look like? What do we want the Customer hierarchy to look like? Geography? Organization?
  2. Identify major data quality issues that deviate from the standard nomenclature.
  3. For each issue, determine how to address the issue. Can it be fixed by overwriting the data entry or coming up with a new custom field? Can the changes be automated or done in bulk, such as updating the underlying table directly? For example, if the country is misspelled, the best solution will be to go that invoice or customer and change the country. Or, if a certain hierarchy member is completely missing, such as continent or territory, add a new custom field.
  4. Assign the list of the data entry issues identified in step 2 to a data entry person to fix. Or, if they can be fixed in bulk, such as by updating the underlying tables, ask your helpful DBA to make the changes.
  5. If possible, ask the developer in charge to modify the source system to improve data quality, such as by using dropdowns instead of free-form entries for important fields and adding data entry rules.

I know this might be too simplistic. For example, the ERP might disable data entries once an invoice is approved, and I’ve seen clients adding data correction tables in data marts to get around this. If updating the data directly in the underlying tables is an option, this might be your best bet. However, if compliance or other regulations prevent this from happening, then the data must be fixed outside.

However, the main principle remains that data quality must be tackled as close to the source as possible. And improving data quality is your best investment and prerequisite to enable effective data analytics.

Top 5 Lessons Learned from IBM Db2 Integration Projects

I’ve done a few BI integration projects extracting data from ERPs running on IBM Db2. Most of the implementations would use a hybrid architecture where the ERP would be running on an on-prem mainframe while the data was loaded in Microsoft Azure. Here are a few tips if you’re facing this challenge:

  1. IBM provides JDBC, ODBC, and OLE DB drivers. The JDBC driver is not applicable to the Microsoft toolset because none of the Microsoft BI tools run on Java runtime. Microsoft provides an OLE DB Provider for Db2.
  2. If you use Azure Data Factory or Power BI Desktop, you can use the bundled Microsoft OLE DB driver for Db2. Here are the Azure Data Factory settings for setting up a linked service using the Db2 driver:
    1. Server name – specify the server name of IP address of the IBM Db2 server. For DB2 LUW the default port is 50000, for AS400 the default port is 446 or 448 when TLS enabled. Example: erp.servername.com:446 to connect to AS400.
    2. Database name – that one is tricky as I’ve found that even the client doesn’t know it because the ODBC/JDBC driver hides it. The easiest way to obtain it is to connect Power BI Desktop to Db2 using the ODBC driver and look at the Navigator window.
    3. Package collection: This one can get you in trouble. If you get an error “The package corresponding to an SQL statement execution request was not found. SQLSTATE=51002 SQLCODE=-805”, enter NULLID as explained in the ADF documentation (see the link above).
  3. To use the IBM drivers, you need to install them on the on-prem VM hosting the ADF self-hosted runtime. Instead, consider using the Microsoft-provided drivers because they are bundled with the runtime and there is nothing for you to install.
  4. Based on my tests, all drivers have similar performance but do upgrade to the latest IBM drivers if you plan to use them. In one project, I’ve found that the latest IBM ODBC driver outperformed twice an older driver the client had.
  5. The read throughput is going to be mostly limited by the mainframe server itself and the connection bandwidth between the data center and Azure. In all projects, the mainframe read speed was by far slower than reading data from a modest SQL Server. For example, one project showed that it would take 2.5 minutes to read one million rows from the production mainframe, while it would take 40 seconds to read the same data from SQL Server.

     

    While optimizing the mainframe performance will likely be outside your realm of possibilities, you could lobby to increase the connection throughput between the data center and Azure if you plan to transfer a lot of data. Aim for 500 Mbps or higher. For example, after 2.5 minutes to read aforementioned 1 million row dataset from the server, it took another 2.5 minutes to transfer the data (about 2.5 Gb) to Azure because the connection throughput for this project was only 100 Mbps.

Fixing ADF DataFactoryPropertyUpdateNotSupported Deployment Error

This took a while to figure out…so someone might find it useful.

Scenario: You have set up two Azure Data Factory instances for DEV and PROD environment. The DEV environment uses a self-hosted integration runtime to read data from on-prem data sources. The day has come to deploy your masterpiece to PROD. You share the DEV self-hosted runtime with the production ADF instance. You export the ADF ARM template and attempt to run it against the pristine PROD environment. You’re greeted with DataFactoryPropertyUpdateNotSupported error. The “property” here is linked self-hosted runtime.

Solution: The following solution worked for me. In your ARM template, scroll all the way to the bottom and add the following to the typeProperties element of the self-hosted runtime:

IMPORTANT: Make sure that in your production ADF, you have created a shared self-hosted runtime that points to the one in your DEV environment before you proceed with the ARM template deployment. 

{
   "type": "Microsoft.DataFactory/factories/integrationRuntimes",
   "apiVersion": "2018-06-01",
   "name": "[concat(parameters('factoryName'), '/mtx-adf-shir-us')]",
   "dependsOn": [],
   "properties": {
       "type": "SelfHosted",
       "typeProperties": {
           "linkedInfo": {
               "resourceId": "Resource ID obtained when you share the self-hosted runtime in DEV",
               "authorizationType": "Rbac"
           }
       }
   }
}

Atlanta Microsoft BI Group Meeting on August 7th (Introducing Microsoft Fabric)

Atlanta BI fans, please join us for the next meeting on Monday, August 7th, at 6:30 PM ET.  James Serra (Data & AI Solution Architect at Microsoft) will join us remotely to introduce us to Microsoft Fabric. For more details and sign up, visit our group page.

PLEASE NOTE A CHANGE TO OUR MEETING POLICY. WE HAVE DISCONTINUED ONLINE MEETINGS VIA TEAMS. THIS GROUP MEETS ONLY IN PERSON. WE WON’T RECORD MEETINGS ANYMORE. THEREFORE, AS DURING THE PRE-PANDEMIC TIMES, PLEASE RSVP AND ATTEND IN PERSON IF YOU ARE INTERESTED IN THIS MEETING.

 Presentation: Introducing Microsoft Fabric

Delivery: Speaker will join us remotely via Teams

Date: August 7th

Time: 18:30 – 20:30 ET

Level: Beginner

Food: Sponsor wanted

 

Agenda:

18:15-18:30 Registration and networking

18:30-19:00 Organizer and sponsor time (events, Power BI latest, sponsor marketing)

19:00-20:15 Main presentation

20:15-20:30 Q&A

ONSITE

Improving Office
11675 Rainwater Dr
Suite #100
Alpharetta, GA 30009

 

Overview: Microsoft Fabric is the next version of Azure Data Factory, Azure Data Explorer, Azure Synapse Analytics, and Power BI. It brings all these capabilities together into a single unified analytics platform that goes from the data lake to the business user in a SaaS-like environment. Therefore, the vision of Fabric is to be a one-stop shop for all the analytical needs for every enterprise and one platform for everyone from a citizen developer to a data engineer. Fabric will cover the complete spectrum of services including data movement, data lake, data engineering, data integration and data science, observational analytics, and business intelligence. With Fabric, there is no need to stitch together different services from multiple vendors. Instead, the customer enjoys end-to-end, highly integrated, single offering that is easy to understand, onboard, create and operate.

This is a hugely important new product from Microsoft, and I will simplify your understanding of it via a presentation and demo.

Speaker: James Serra works at Microsoft as a big data and data warehousing solution architect where he has been for most of the last nine years.  He is a thought leader in the use and application of Big Data and advanced analytics, including data architectures, such as the modern data warehouse, data lakehouse, data fabric, and data mesh.  Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer.  He is a prior SQL Server MVP with over 35 years of IT experience.  He is a popular blogger (JamesSerra.com) and speaker, having presented at dozens of major events including SQLBits, PASS Summit, Data Summit and the Enterprise Data World conference.  He is the author of the upcoming book “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh”.

PowerBILogo

A First Look at Microsoft Fabric: Recap

Did I disappoint you?
Or leave a bad taste in your mouth?
You act like you never had love
And you want me to go without

U2

In previous posts, I shared my initial impression of the recently announced Microsoft Fabric and its main engines. Now that we have the Fabric licensing and pricing, I’m ready to wrap up my review with a few parting notes. Here is how I plan to position Fabric to my clients:

Enterprise clients

These clients have complex data integration needs. More than likely, they are already on a Power BI Premium contract and highly-discounted pricing model that is reviewed and renewed annually with Microsoft. Given that Fabric can be enabled on premium capacities, you should definitely consider it selectively when it makes sense. For now, I believe a good case can be made for data lake and lakehouse if that’s your thing.

Now you have an alternative to Databricks and you can standardize BI on one platform and vendor.

I don’t have experience in Databricks to offer more in-depth comparison, but in my opinion the most compelling features to favor Fabric for now are:

  1. No additional cost or Power BI Premium capacity upgrade if you aren’t reaching the workload limits
  2. One platform and one vendor to avoid the blame game when things don’t work
  3. Fast Direct Lake data access for ad-hoc analysis directly on top of files in the lakehouse
  4. Easy data virtualization

If you decide in Fabric’s favor, you’d be wise to reduce dependencies on Microsoft proprietary and bundled features, such as Power Query dataflows and data pipelines insides Fabric (I’d use a stand-alone ADF instance once ADF supports Fabric). Hopefully, bring-your-own-lake will appear on day to circumvent the Fabric OneLake shortcomings.

Small and medium-size clients

Unfortunately, Microsoft didn’t make Fabric available with PPU (premium-per-user) licensing. This would surely put it out of reach for smaller organizations. True, you can purchase a Fabric F2 license for as little as $262/month and run it on a quarter of a core. I didn’t know a quarter of a core existed, but Microsoft did it, although you probably won’t get too far with it for production use (see results from my F2 limited performance tests here). You can opt for a higher SKU, but it would increase your bill and Fabric capacities can’t be auto-paused. For example, a “luxurious” 1 core (F8 plan) will put you in the 1K/month range, plus Power BI Pro licenses for all users (contributors and viewers).

But fear not. There is nothing in Fabric that you desperately need or can’t obtain outside Fabric in a much more cost-effective way.

Expect Microsoft to push Fabric aggressively. However, I believe Fabric has more appeal for large organizations while low-budged simple solutions with Power BI Pro or PPU licensing would likely better address your needs. And your BI solution is still going to be “modern”, whatever that means…

Fabric Semantic Modeling: The Good, the Bad, and the Ugly

In retrospect, I’d say I owe 50% of my BI career to Analysis Services and its flavors: Multidimensional, Tabular, and later Power BI. This is why I closely follow how this technology evolves. Fast forwarding to Fabric, there are no dramatic changes. Unlike the other two Fabric Engines (Lakehouse and Warehouse), Power BI datasets haven’t embraced the delta lake file format to store its data yet. The most significant change is the introduction of a new Direct Lake data access mode alongside the existing Import and DirectQuery.

The Good

Direct Lake will surely enable interesting scenarios, such as real-time BI on top of streaming data. It requires Parquet delta lake files and therefore it’s available only when connecting to the Lakehouse managed area (Tables folder) and Warehouse tables. Given that Parquet is a columnar format, which is what Tabular VertiPaq is, basically Microsoft has changed the engine to read Parquet files as it does with its proprietary IDF file format.

The primary usage scenario is fast analysis on top of large data volumes without importing data and without delegating the query to another server. Therefore, think of Direct Lake is a hybrid between Import and DirectQuery modes. By “large data volumes”, I mean data that otherwise won’t fit into memory and/or it will require substantial time to refresh but low latency access would be preferable.

Microsoft has accomplished this feat by using the following existing and new Analysis Services features:

  • Vertiscan – The ability for Analysis Services to query columnar storage. Instead of using the IDF file format to store the Vertipaq data, DirectLake instead uses the Parquet file format in Lakehouse or Warehouse. The AS engine loads the data from the Parquet files (with some extra effort) and maps the column values into (mostly) the same data structures that would have been used if the data was coming from IDF files. After that, Vertiscan is querying the data as if it was Import data, so query performance should be at par with Import mode.
  • On-demand data loading – The ability to page in and out data that was introduced in 2021 for imported data. If the data needs to be paged in, there will be some delay but after that it will be fast until and unless it gets paged out later on. Chris Webb covers on-demand loading in his post On-Demand Loading Of Direct Lake Power BI Datasets In Fabric.
  • V-order – an extension to the Parquet file format to get a better compression like VertiPaq

The Bad

Naturally, I’d like to see Direct Lake available outside Fabric.

Currently, here is what needs to happen to connect to external Delta Parquet files, such as files located in ADLS:

  • Create a lakehouse.
  • Create a shortcut in OneLake to the external source table.
  • Create the dataset on top of the lakehouse

As you can see, you can’t escape the Fabric gravitational pull to get Direct Lake. Further, the Parquet files produced by the Fabric workloads (Lakehouse/DW/etc.) will typically be faster and more compressed because of the V-order compression.

The Ugly

Among the Direct Lake limitations, the most significant for me is that not only you need Fabric to get Direct Lake, but also you must create the dataset online using the “New Power BI dataset” feature in Lakehouse/Warehouse, which has its own limitations.

Therefore, for now you can’t use Power BI Desktop to create your semantic model that uses Direct Lake connectivity. This will require Write support to be added to the Analysis Services Power BI XMLA endpoint. However, once you create the Direct Lake dataset, you can use Power BI Desktop to connect to it using the OneLake Data Hub connector.

Power BI Projects and Git Integration: The Good, the Bad, and the Ugly

Is it getting better?
Or do you feel the same?
Will it make it easier on you now?
You got someone to blame

U2

Likely influenced by the Gartner’s Fabric vision, Microsoft Fabric is eclipsing all things Power BI nowadays to the extent of replacing the strong Power BI brand and logo with Fabric’s. Now that the red star Synapse has imploded into a black hole, Fabric has taken its place and it’s engulfing everything in its path.

But to digress from Fabric, let’s take a look at two developer-oriented and frequently requested features that fortunately doesn’t require Fabric: Power BI Desktop projects and workspace integration with Git. The video in the link is a good starting point to understand how these features work.

Basically, the first feature lets you save a Power BI Desktop file as a *.pbip file which generates a set of folders with human-readable text files, such as the model.bim file (has the model definition described in JSON). Next, you can put these files under source control using any source control-enabled tool, such as Visual Studio Code or Visual Studio. Currently in preview, Power BI projects is an experimental feature (in fact, the Publish button is disabled so you can’t publish a project to Power BI Service).


The second feature (workspace integration with Git) lets you configure a workspace to keep Power BI reports and datasets published to the workspace under source control. No other items are currently supported (even paginated reports). This feature is independent of the client-side project mode and it doesn’t require saving Power BI Desktop files as projects.

The Good

Although the primary target of PBI Desktop projects is to enable source control, a more useful side effect to me is that we can now get to and edit the report schema.

The other day someone asked me how you can apply conditional settings of one field in a visual to another. Or, how can you replace the field that has conditional settings applied without loosing them. Well, with some JSON wizardry, you can open the report.json file and make the changes there. At your risk of course because the report schema is ugly and not documented.

I also commend Microsoft for supporting Power BI workspace Git integration in Premium Per User (PPU) licensing to make it more accessible (unfortunately, it’s not available to the masses operating on Power BI Pro). I like the two-way integration in a Git-enabled workspace. It works like Azure Data Factory, so I don’t have to configure source control on the client.

The Bad

The report JSON schema is not documented and as such source code compare is difficult. Parsing a name-value collection in the visual’s config section is required to get to the useful staff.

I hope at some point Microsoft will simplify and document the report schema to let the community take over where Microsoft left off and create tools to compare the report definition.

Naturally, I’d like to see Git integration supporting all workspace items, such as paginated reports, dashboards, and Fabric artifacts.

The Ugly

If you follow my blog, you know that I’m not a big fan of Power BI Desktop for semantic modeling. Instead, I rely on Tabular Editor. When in pbix mode, changes saved in Tabular Editor appear in Power BI Desktop (the reverse is not true). That’s because Tabular Editor connected to a pbix file actually connects to the running instance of the Analysis Services engine and applies changes there. It edits the same in-memory model that PBI Desktop is connected to, thus allowing PBI Desktop to automatically refresh metadata changes.

By contrast, changes made in Tabular Editor in project mode are not reflected in Power BI Desktop.

That’s because changes made in Tabular Editor in project mode are written to the model.bim file and you need to close and open PBI Desktop for these changes to appear. This can get tedious very quickly. My recommendation would be for Power BI Desktop to listen to file changes and automatically refresh the metadata.

Summary

I’ll wait for the project mode feature to mature on the client. Meanwhile, I’ll continue with *.pbix and enable source control integration at the workspace level for clients on PPU or higher licensing. With Power BI Pro licensing, I’ll continue putting the model.bim file from Tabular Editor under source control (no source control for reports).

Fabric Data Integration: The Good, the Bad, and the Ugly

Oops, I did it again
I played with your heart
Got lost in the game…

Britney Spears

In previous posts, I shared my thoughts about Fabric OneLake, Lakehouse, and Data Warehouse. They are of course useless if there is no way to get data in and out of Fabric. Data integration and data quality is usually the most difficult part of implementing a BI solution, accounting for 60-80% of the overall effort. Therefore, this post is about Fabric data integration options.

Fabric supports three options for automated data integration: Data Pipeline (Azure Data Factory pipeline), Dataflow Gen2 (Power BI dataflow), and Notebook (Spark). I summarize these three options in the following table, which loosely resembles the Microsoft comparison table but with my take on it.

Data pipeline
(ADF pipeline/copy activity)
Dataflow Gen2
(Power BI dataflow)
Notebook
(Spark)
Primary userBI developerBusiness analystData scientist, Developer
Patterns supportedETL/ELTETLETL
Primary skillsetSQLPower QuerySpark
Data volumeHighLow to mediumHigh
Primary code languageSQLMScala, Python, Spark SQL, R
ComplexityMediumLowHigh
Vendor lock-inMedium (minimize with ELT pattern)HighLow

The Good

We have three options for data integration to support different personas and skillsets. Unlike other “a notebook with a blinking cursor” vendors, data pipeline and dataflow provide no code/low code options.

Power BI dataflows are now supposedly more scalable, which apparently justifies the Gen2 tag. They finally support destinations although the list is limited to Azure SQL Database and the Fabric engines (Lakehouse, Warehouse, and KQL Database).

The Copy activity in Fabric data pipelines now supports creating delta tables although it doesn’t support merges.

The Bad

Microsoft is pushing dataflows to “data engineer, data integrator, and business analyst”. My guidance is to consider dataflows only if you want to open data ingestion to business users (something you must carefully think about and definitely surround it with a log of supervision). As its predecessor (Power BI dataflows), Power Query is notoriously difficult to troubleshoot or optimize. It doesn’t support the ELT pattern (my favorite), such as to handle Type 2 changes. This could be partially ramified by implementing a pipeline that mixes dataflows with other ADF artifacts, such as calling stored procedures in Fabric Warehouse. Moreover, I consider Power Query as a Microsoft proprietary tool, irrespective that the M language is documented. If one day you decide to leave Fabric, you’d need to rewrite your flows. Finally, the only output options supported are append or replace (no update).

Moving to ADF, the copy activity supports only append or replace (no update). Outside Fabric, Azure Data Fabric doesn’t have connectors to connect to Fabric yet.

I personally abhor the idea to put all BI artifacts in Fabric, if not for anything else, but to have a better way out if one day a client decides to part ways with Fabric.

Haven’t we learned anything from Synapse pipelines? Ask Microsoft how to migrate them to Fabric if you have fallen for that “best practice”. I’d carefully weight going all the way with Fabric (I know that bundles are a big incentive) instead of being more independent and use Fabric more selectively.

For data warehousing, which is the primary scenario I personally care about in Fabric, I primarily rely on the ELT pattern for a variety of reasons. I shall miss T-SQL MERGE in Fabric Warehouse, but I plan to leave it to marinate for a year or so anyway.

The Ugly

After all the push to use ADF mapping data flows in Synapse, where are they hiding in Fabric? Alas, they haven’t made it and they were superseded by Power BI dataflows.

This underscores another important reason to use the ETL pattern whenever you can. At least you can salvage your SQL code as a vendor “evolves” or “revolutionizes” their offerings. Which is another way of a vendor saying “oops, we did it again…” and we shall go back to the drawing board.