Datasets vs Datamarts
I sent a proposal for implementing a classic BI solution: Azure SQL-based datamart (not Power BI datamart please), ETL, semantic model, and reports. The client had a sticker shock. Return to sender … as other BI companies that quoted can do it for half! Upon digging, it turned out the other companies would build the semantic model (aka Power BI dataset) directly on top of the data source. On a Time&Materials (T&M) basis and charge by the hour, of course, what else? By contrast, I give fixed-price milestone-driven proposals and I don’t get paid unless I deliver and meet written and agreed upon success criteria, but that’s a different story.
So, let me count the ways as the poet would say. It’s certainly technically possible to slap a dataset on top of the data source(s). That’s what self-service BI is all about right … until it doesn’t serve anymore. Check the Microsoft’s “discipline at the core” story about that journey ended. But BI pros can do it better and more efficiently and still bypass building the datamart, right? Here is why the “shortcut” will probably not work so well:
- Architecture – If you don’t have a datamart, you’re betting it all on Power BI. But tools come and go and by no means I’d put all my eggs in a single basket regardless the respect I have to Power BI (far from ideal of course). By contrast, if one day you decide to switch to another tool and you have invested in a datamart, you have to replace the semantic model and reports only. Data staging, transformations, and improved data quality will stay on.
- ETL – We all agree by now that the star schema is our best friend. It’s certainly possible to use Power Query to shape the data anyway you want it. But Power Query can be notoriously slow and difficult to troubleshoot. I’ve seen companies getting in a lot of trouble when tilting too much toward Power Query. Also, what ETL assumptions are you making and what limitations betting against when using Power Query? No advanced transforms and no advanced requirements, such as Type 2 changes? No incremental data loads? No restartability? No decent monitoring and troubleshooting?
- Data integration – Important data should be consolidated and centralized into a repository. And that repository should be a relational database. Also, what about making the data available to other tools? Should we lock them to using the Power BI XMLA endpoint?
- Semantic model – What if you have to support data refreshes at different granularity, such as hourly vs daily, or import vs direct query? What if one Power Query fails to refresh? Should we fail the entire refresh?
- Data volumes – Although your initial dataset might be less than a million rows, what if that changes or customers decides to use larger external data? Can Power Query handle this?
Shortcuts are tempting and disguise themselves as cost-effective. If you’re a data analyst that doesn’t know or can’t afford any better, surely take the data source->dataset approach. At least, if you’re sourcing data from a relational database, insist on SQL views so you can offload some transformations upstream. If you’re a BI pro and you’re building a pilot, go ahead. But if you’re to build an organizational BI solution that must adapt, evolve, and endure, at least let your sponsor know about what assumptions and tradeoffs you’re making along the way. Let’s be honest.