Are You Modern Yet?

If you’re a BI architect responsible for designing Azure-based cloud BI solutions and you’re not following the James Serra’s blog, you’re missing a lot. Previously an independent consultant and MVP, James works now as a Big Data/Data Warehouse Evangelist at Microsoft. I can’t agree more with his latest blog “Should I load structured data into my data lake?

We’ve heard a lot lately about what the “modern” data warehouse architecture should look like from Microsoft and other vendors. Here is the diagram taken from, you guessed it, the Microsoft Modern Data Warehouse page.

112918_0247_AreYouModer1.png

The problem is that many organizations take this diagram verbatim when it makes sense and when it doesn’t. For example, a client exports data from on-premises relational databases to text files in Azure Data Lake, only to load it later in a cloud data warehouse. To my observation that this is redundant and pointless, they answered that I shouldn’t deemphasize the “whole Big Data” architecture should one day their users decide to do machine learning. To my observation that should this moment arrive, data scientists will benefit more if the data is readily accessible in a relational database, they said that it contradicts what Microsoft recommended.

Of course, this “modern” architecture makes a lot of sense to the vendor (it’s a nice and perpetual revenue stream) or the consulting partner (the more stuff, the higher the bill) but more than likely is an overkill for you. Here are the most common issues I see organizations making on their BI journey to the cloud:

  1. Excessive data movement – The more you move the data, the more problems. You know it and I know it. Data quality issues, latency, maintenance and endless troubleshooting, etc. In another case, I saw a Cosmos DB sneaked in between, just in case there is too much structure in the source data.
  2. Architecture “gold plating” – A classic example, which the diagram perpetuates, is Azure SQL Data Warehouse. If you have a few million rows of data in your data center, would you buy Teradata? Of course, you wouldn’t! Now, just because you can afford the monthly premium for Azure SQL Data Warehouse, why would you go for an MPP platform? Not to mention that more than likely you’d have to redo your entire ETL if you are migrating from premises as there is still no T-SQL parity. I wouldn’t even consider Azure SQL Data Warehouse for workloads less than 1 TB. And not even then because a semantic model will probably do the heavy lifting. And on ETL, where Microsoft recommends ADF, read my thoughts on it here.
  3. Migrating to the cloud to save cost – No vendor would let its cloud (or whatever) alternative cannibalize another offering. For me, your primary goal for migrating to the cloud should be not cost savings but reliability and convenience. Try to get your IT department to promise SLA or promise anything for that matter. Or provision a VM in minutes? Cost alone, more than likely you’ll spend more moving to the cloud, especially if you go PaaS. For example, a client found that their bill went up astronomically because each BI PaaS offering (Database Engine, Analysis Services, ETL, Power BI) has its own pricing compared to covering all of these with a single SQL Server EE license in their data center. The “modern” architecture above could easily run $50,000/mo which is what a perpetual SQL Server EE license will cost you but over years.

They asked a famous writer one day when he knows that the manuscript is done, and he said when there is nothing more to take out. This principle should apply to BI architectures too. What can you take out to meet the business requirements without overcomplicating? It does cost a lot to be “modern”.

Atlanta MS BI and Power BI Group Meeting on December 3th

MS BI fans, join us for the next Atlanta MS BI and Power BI Group meeting on December 3th, Tuesday at 6:30 PM at the Microsoft office in Alpharetta. I will discuss Power BI dataflows. A3 will sponsor the event. For more details, visit our group page and don’t forget to RSVP (use the RSVP survey on the main page) if you’re planning to attend.

Presentation:Understanding Power BI Dataflows
Date:December 3, 2018, Monday
Time6:30 – 8:30 PM ET
Place:Microsoft Office (Alpharetta)

8000 Avalon Boulevard Suite 900
Alpharetta, GA 30009

Overview:If you’re a business user, you might know about Microsoft Flow. If you’re a BI pro, you might have used the SSIS Control Flow and Data Flow. But do you know about the Power BI dataflow? Data preparation is often the most labor-intensive component of an analytics project and getting this right is vital if the results are to be accurate. While the growth of self-service BI has empowered users to answer important questions, ensuring well managed data is available to all employees remains one of businesses biggest challenges. Requiring no additional cost, dataflows are a building block of the Power BI ecosystem, providing a no-code/low-code approach using Power Query to create curated datasets that can be easily consumed by business analysts. Attend this presentation and you’ll learn:

  • How to stage data with dataflows
  • How dataflows can be used to create curated datasets
  • How dataflows can be created with the click of a button, ingesting data from many well knows SAAS solution, such as Dynamics and Salesforce
  • More advanced integration scenarios with Power BI Premium
Speaker:Through his Atlanta-based company Prologika (https://prologika.com), a Microsoft Gold Partner in Data Analytics, Teo Lachev helps organizations make sense of their most valuable asset: their data. His strategy formulation, trusted advisory and mentoring, design and implementation services empower clients to apply effectively data analytics in order to understand, improve, and transform their business processes. Teo has authored and co-authored several bestselling books on organizational and self-service data analytics, and he has been leading the Atlanta Microsoft BI and Power BI group since he founded it in 2010. Teo has been a Microsoft Most Valued Professional (MVP) Data Platform since 2004.
Sponsor:A3 Solutions is the developer of A3 Modeling. Empower your finance team with the only tool that supercharges end-user Excel into Enterprise Excel. Why incur the risk and cost to replace your Excel models? Just add A3! The San Francisco-based company’s product is in use at some of the world’s largest manufacturers, financial firms and retailers including Honda Manufacturing, Williams Sonoma, T.Rowe Price, and Fox TV, in addition to many mid-sized organizations. http://www.a3solutions.com/
Prototypes with Pizza“New relationship view, paginated reports, and filter pane” by Teo Lachev


092417_1708_AtlantaMSBI1.png

Dataflows: The Good, the Bad, the Ugly

This blog is an update of my previous blog “Common Data Service for Analytics: The Good, the Bad, the Ugly“. Things have changed in the past 6 months. The Common Data Service for Analytics is no more. It got superseded by Power BI dataflows, which Microsoft officially introduced at the PASS Summit last week.

Think of dataflows as “Power Query in the Cloud”. Dataflows is to self-service BI what ETL is to Power BI pros.

A dataflow is a collection of Power Query queries, also known as entities, and it lives in a Power BI organizational workspace. For example, if I want to stage data from Salesforce, I can create a dataflow with as many entities as Salesforce tables I want to stage. Unlike Power Query, which can only output to a table in a data model, a dataflow saves its output as a pair of a CSV file (data) and JSON file (metadata) in Azure Data Lake Store (ADLS).

111718_0237_DataflowsTh1.png

The Good

There is a lot to like about dataflows. I can think of two primary self-service scenarios that can benefit from dataflows:

  • Data staging – Many organizations implement operational data stores (ODS) and staging databases before the data is processed and loaded in a data warehouse. As a business user, you can use data-flows for a similar purpose. For example, one of our clients is a large insurance company that uses Microsoft Dynamics 365 for customer relationship management. Various data analysts create data models from the same CRM data, but they find that refreshing the CRM data is time consuming. Instead, they can create a dataflow to stage some CRM entities before importing them in Power BI Desktop. Even better, you could import the staged CRM data into a single dataset or in an organizational semantic model to multiple data copies and duplicating business logic.
  • Certified datasets – One way to improve data quality and promote better self-service BI is to prepare a set of certified common entities, such as Organization, Product, and Vendor. A data steward can be responsible for designing and managing these entities. Once in place, data analysts can import the certified entities in their data models.

There is a solid architecture, vision, and investments behind dataflows. You create dataflows using a tool that you’re already familiar with: Power Query. If you know Power Query, you know dataflows. Microsoft has provided the data lake storage and pricing is included in Power BI Pro/Power BI Premium. So, you have everything you need to get started with dataflows today. Ingesting the dataflow output is very fast too because you import CSV files.

The Bad

Let’s start with positioning. I’ve heard Microsoft position dataflows for any data integration task, from data staging to loading data warehouses and even replacing data warehouses and ETL (heard that vibe before?) I’ve seen business users doing impressive things with Power BI. But I’ve seen them also attempting to implement organizational solutions that collapse from their own weight. Dataflows are not an exception and I don’t think it’s a business user’s job to tackle ambitious data integration tasks.

You need Power BI Premium to realize the full potential of dataflows. For some obscure reason, features like linked and computed entities are not available in Power BI Pro. I wonder how many customers will feel pushed to go Power BI Premium as more and more features are only available there. I don’t mind scalability and performance related features but incremental refresh and linked entities?

The CSV output is both a blessing and a curse. True, ingestion is fast but text files are not a relational database. For example, the only option is to import the dataflow output in Power BI as DirectQuery is not available for text files.

The Ugly

You (almost) can’t access directly the output generated by dataflows. This precludes the continuum between self-service and organizational worlds. For example, IT might need to import a certified dataset into a data warehouse. But you can only connect to a dataflow in Power BI Desktop because the CSV files are not accessible. Power BI Premium will let you bring your own data lake storage in near future but a better option in my opinion was to provide access to the raw data in both Power BI Pro and Premium given that this is your data lake and your data.