Organizational Semantic Model

I am delivering a data governance assessment for an enterprise client. As a part of the effort to migrate reporting from MicroStrategy to Power BI, the client wants to improve data analytics. The gap analysis interviews with the business leaders revealed common pitfalls: no single version of truth, data is hard to come by, business users don’t know what data sources exist, business users spend more time in data wrangling than analytics, data quality is bad, IT is overwhelmed with report requests, report proliferation and duplication, and so on…

Sounds familiar? As I mentioned many times in my blog, an enterprise data warehouse (EDW) plays a critical role in overcoming the above challenges, but it’s not enough. A semantic model is needed and I extolled its virtues in my “Why Semantic Layer?” newsletter. In the Microsoft BI world, Analysis Services Tabular is commonly used to implement such models that are typically layered on top of EDW . In general, there are two ways to approach the model implementation:

  1. (Self-service BI path) Business users create self-service semantic models using Power BI Desktop. Behind the scenes, Power BI creates databases hosted in the Analysis Services Tabular server from the *.pbix files.
  2. (Organizational BI path) BI developers implement organizational semantic models.

Since both implementation paths lead to the same technology, it boils down to ownership, vision, and purpose.

Because IT is overwhelmed, the temptation is to transfer the semantic model development to business users. The issue with this approach is business users seldom have the skills, time, and vision to do so. And the end, mini “semantic models” (“spreadmarts”) are produced and the same problems are perpetuated.

In most cases, my recommendation is for IT to own the semantic model because they own the data warehouse and the “Discipline at the Core” vision. And yes, unless operational and security requirements dictate otherwise, it should strive for a single centralized semantic model that spans all subject areas. If the technology you use for semantic modeling can’t deliver acceptable performance with large models, then it’s time to change it.

Of course, not all data exists or will exist in EDW. This is where self-service “Flexibility at the Edge” comes in. I have high expectations for the forthcoming “Composite models over Power BI datasets and Azure Analysis Services” mega feature (public preview expected in November 2020). This will enable the following scenario that Power BI cannot deliver today:

  1. Business user starts by connecting live to corporate data in the organizational semantic model. Every new report requirement should start with evaluating if all or some of the data is in the semantic model, and if so, instructing the user to connect to the semantic model to avoid data modeling and data duplication.
  2. Business user wants to mash up this data with some data that is not in EDW by retaining the live connection to the organizational semantic model and importing (or connective live with DirectQuery) to other datasets.

This will give us the best of both worlds and help us achieve “Discipline at the Core, Flexibility at the Edge“.

Section Hiked A.T. in Georgia

My wife and I started section hiking the Appalachian Trail during weekends to escape the summer heat and the virus. A.T. runs for 2,200 miles from Georgia to Maine, with 78.6 miles in Georgia. Today we finished the Georgia part and entered North Carolina. We actually covered twice the distance (averaging 10-12 miles per section) because we had to come back each time to where we parked. We started hiking with the great Atlanta Outdoor Club back in February, 2020. But when the virus hit, group hikes were put on hold, so we were left to our own devices.

Hiking somehow grew on me. Perhaps, because it as a metaphor for life. There are ups and downs. Some sections are hard and require a great deal of effort and perspiration, while others are easy. There are exhilarating views but there are also areas with overgrown vegetation. Perceived risks, such as wild animals (note the bear spray on my belt ), are lurking in the distance. Some people tell you that you’re crazy. The climate is unpredictable, and planning is sometimes futile (we got ourselves in a really bad thunder storm and downpour once while the weather was supposed to be OK). But for the most part, it is just putting one foot in front of the other. It’s about the journey, not the destination. And the more you put in, the more you’ll get out.

Kudos to the wonderful people who maintain the A.T. and those who supported us with advice! Kudos to these brave thru-hikers that cover the whole distance. I’m really jealous…one day maybe.

Azure Synapse: The Good, The Bad, and The Ugly

Cloud deployments are the norm nowadays for new software projects, including BI. And Azure Synapse shows a great potential for modern cloud-based data analytics. Here are some high-level pros and cons to keep in mind for implementing Azure Synapse-centered solutions that I harvested from my real-life projects and workshops.

The Good

There is plenty to like in Azure Synapse which is the evaluation of Azure SQL DW. If you’re tasked to implement a cloud-based data warehouse, you have a choice among three Azure SQL Server-based PaaS offerings, including Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse. In a nutshell, Azure SQL Database and Azure SQL MI are optimized for OLTP workloads. For example, they have full logging enabled and replicate each transaction across replicas. Full logging is usually a no-no for decent size DW workloads because of the massive ETL changes involved.

In addition, to achieve good performance, you’ll find yourself moving up the performance tiers and toward the price point of the lower Azure Synapse SKUs. Further, unlike Azure SQL Database, Azure Synapse can be paused, such as when reports hit a semantic layer instead of DW, and this may offer additional cost cutting options.

At the heart of Azure Synapse is the SQL Pool (previously known as Azure SQL DW) which hosts your DW. As an MPP system, it can scale to petabytes of data with proper sizing and good design. The obvious benefit is that for the most part (see the Ugly section discussing exclusions) you can carry your SQL Server skills to Azure Synapse.

As a general rule of thumb, Azure Synapse should be at the forefront of your modern cloud DW with data loads starting with 500 GB.

I liked the integration with Azure Data Lake Storage (ADLS). Previously, you had to set up external tables to access files in ADLS via PolyBase. This is now unnecessary thanks to the serverless (on-demand) endpoint which also replaces the Gen1 U-SQL with T-SQL. All you have to do is link your ADLS storage accounts and use the T-SQL OPENROWSET construct to query your ADLS files and exposed them as SQL datasets. Not only does this opens new opportunities for ad hoc reporting directly on top of the data lake, but it also can help you save tons of ETL effort to avoid staging the ADLS data while loading your DW.

As the diagram shows, you can also provision an Azure Spark runtime under a Synapse workspace. This opens the possibility for using Azure Spark/Databricks for data transformation and data science. For example, a data scientist can create notebooks in Synapse Studio that use a programming language of their choice (SQL, Python, .NET, Java, Scala, and R are currently supported).

The Bad

You have two coding options with Synapse.

  1. You can use the respective development environment, such as Azure Data Factory Studio for ADF development and Power BI Desktop for Power BI report development.
  2. You can use Synapse Studio which attempts to centralize coding in one place.

Besides the ADLS integration and notebooks, I find no advantages in the Synapse Studio “unified experience”, which is currently in preview. In fact, I found it confusing and limiting. The idea here is to offer an online development environment for coding and accessing in one place all artifacts surrounding your DW project. Currently, these are limited to T-SQL scripts, notebooks, Azure Data Factory pipelines, and Power BI datasets and reports.

However, I see no good reason to abandon the ADF Studio and do ETL coding in Synapse Studio. In fact, I can only see reasons against doing so. For example, Synapse Studio lags in ADF features, such as source code integration and wrangling workflows. For some obscure reason, Synapse Studio separates data flows from ADF pipelines in its own Data Flows section in the Development Hub. This is like putting the SSIS data flow outside the package. Not to mention that inline data flows should be deemphasized since ELT (not ETL) is the preferred pattern for loading Azure Synapse. To make things worse, once you decide to embrace Synapse Studio for developing ADF artifacts, you can use only Synapse Studio because the ADF pipelines are not available outside Synapse.

On the Power BI side of things, you can create a Power BI dataset in Synapse Studio. All this saves is the effort to go through two Get Data steps in Power BI Desktop to connect to your Azure Synapse endpoint. You can also create Power BI reports using Power BI Embedded in Azure Synapse. Why would I do this outside Power BI Desktop is beyond me. Perhaps, I’m missing the big picture and more features might be coming as Synapse evolves. Speaking of evolution, as of this time, only SQL pools are generally available. Everything else is in preview.

The Ugly

Developers moving from on-prem BI implementations to Azure Synapse, would naturally expect all SQL Server features to be supported, just like moving from on-prem SSAS to Azure AS. However, they will quickly find that Azure Synapse is “one of a kind” SQL Server. The nice user interface in SSMS for carrying out admin tasks, such as creating logins and assigning logins to roles, doesn’t work with Azure Synapse (all changes require T-SQL). SQL Profiler doesn’t work (tip: use the Profiler extension in Azure Data Studio for tracing queries in real time).

Because Azure Synapse uses Azure SQL Database, it’s limited to one physical database, so you must rely on using schemas to separate your staging and DW tables. There are also T-SQL limitations, such as my favorite MERGE for ELT-based UPSERTs is not yet supported (currently in preview for hash and round-robin distributed tables but not for replicated tables, such as dimension tables). So, don’t assume that you can seamlessly migrate your on-prem DW databases to Synapse. Also, don’t assume that your SQL Server knowledge will suffice. Prepare to seriously study Azure Synapse!

Uploading Files to ADLS Gen2 with Python and Service Principal Authentication

I had an integration challenge recently. I set up Azure Data Lake Storage for a client and one of their customers want to use Python to automate the file upload from MacOS (yep, it must be Mac). They found the command line azcopy not to be automatable enough. So, I whipped the following Python code out. I configured service principal authentication to restrict access to a specific blob container instead of using Shared Access Policies which require PowerShell configuration with Gen 2. The comments below should be sufficient to understand the code.

###install dependencies
# install Azure CLI https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest
#pip install azure-identity
#pip install azure-storage-blob
# upgrade or install pywin32 to build 282 to avoid error “DLL load failed: %1 is not a valid Win32 application” while importing azure.identity
# pip install pywin32 –upgrade

# IMPORTANT! set the four environment (bash) variables as per https://docs.microsoft.com/en-us/azure/developer/python/configure-local-development-environment?tabs=cmd

# Note that AZURE_SUBSCRIPTION_ID is enclosed with double quotes while the rest are not

import os

from azure.storage.blob import BlobClient

from azure.identity import DefaultAzureCredential

storage_url = “https://mmadls01.blob.core.windows.net” # mmadls01 is the storage account name 

credential = DefaultAzureCredential() # This will look up env variables to determine the auth mechanism. In this case, it will use service principal authentication 

# Create the client object using the storage URL and the credential

blob_client = BlobClient(storage_url, container_name=“maintenance/in”blob_name=“sample-blob.txt”credential=credential) # “maintenance” is the container, “in” is a folder in that container

# Open a local file and upload its contents to Blob Storage

with open(“./sample-source.txt”“rb”as data:

    blob_client.upload_blob(data)


Presenting Analytics in a Day Workshop on August 20th

In partnership with Microsoft, I’m delivering a complimentary, one-day, Analytics in a Day virtual workshop on August 20th, 9 AM – 5 PM Eastern Time. Targeting BI developers, architects and technology decision makers interested in achieving a single version of truth with organizational BI, this workshop is designed to guide and accelerate your journey towards a modern data warehouse to power your business with Azure Synapse, Azure Data Factory, Azure Data Lake, and Power BI.

The first half of the day from 9 am – 1 pm will help you better understand how to:

  • Create an analytics solution that goes from data ingestion to insights using Azure Synapse Analytics and Power BI
  • Empower self-service analytics
  • Enable a truly data-driven culture in your business

Part of the workshop will be dedicated to hands-on training to help you get started on your cloud analytics journey.

The second half of the day from 1 pm – 5 pm is optional and will focus on best practices around designing Power BI + Azure Synapse Analytics BI solutions to enable “Discipline at the core and flexibility at the edge“.

Say goodbye to data silos. Analytics in a Day is designed to simplify and accelerate your journey towards a modern data warehouse to power your business. Reserve your seat today at http://bit.ly/aid202008.

Discipline at the Core, Flexibility at the Edge

I’m preparing to teach the brand new Analytics in a Day course by Microsoft. This course emphasizes the business value and technical fundamentals for implementing a modern cloud DW using Azure Synapse, ADF, Data Lake, and Power BI. The second half of the class is focused on Power BI and its role for creating organizational semantic models and self-service models from Synapse. I liked the best practices that Microsoft shares based on how they’ve adopted BI over years and challenges they faced with self-service BI, including:

  • Inconsistent data definitions, hierarchies, metrics, KPIs
  • Analysts spending 75% of their time collection and compiling data
  • 78% of reports being creating in “offline environments”
  • Over 350 centralized finance tools and systems
  • Approximately $30M annual spend on “shadow applications”

Indeed, many vendors tout only self-service BI which can quickly lead to chaos. By contrast, I have found that most successful data-driven organizations have both organizational and self-service BI.

Microsoft calls the collection of these best practices “Discipline at the Core, Flexibility at the Edge“.

“Discipline at the Core” is organizational BI where IT retains control to:

  • Deliver standardized and performant corporate BI (single version of truth)
  • Define consistent taxonomies, hierarchies, and KPIs
  • Enforce data permissions centrally.

“Flexibility at the Edge” is self-service BI where data analysts:

  • Quickly create reports sourced from trusted data
  • Mashup core data with departmental data
  • Create new metrics and KPIs relevant to their part of the business

I further recommend the 80/20 rule as a rough guideline, where most of the effort (as much as 80%) goes into integration and curating the data, implementing enterprise models, and enforcing security, and 20% is left for the agility of self-service BI.

Is your organization disciplined at the core, or is the core missing as a result of blind adherence to the self-service BI mantra?

Atlanta MS BI and Power BI Group Meeting on August 3rd

Our group celebrates its 10 anniversary! Please join us online for the next Atlanta MS BI and Power BI Group meeting on Monday, August 3rd, at 6:30 PM. Chris Hamill from the Power BI CAT team will share techniques on creating performant reports without sacrificing design. For more details, visit our group page and don’t forget to RSVP (fill in the RSVP survey if you’re planning to attend).

Presentation:Power BI Report Design Techniques for Performance
Date:August 3rd, 2020
Time6:30 – 8:30 PM ET
Place:Join Microsoft Teams Meeting

Learn more about Teams | Meeting options

Overview:As you have likely observed, a performant model with optimized DAX can suffer greatly if the front-end design is too heavy.  Front end report developers are often challenged with balancing performance and the richness of user requirements. Chris Hamill from Power BI CAT team will share techniques on creating performant reports without sacrificing design.
Speaker:Chris is a Sr. Program Manager for Microsoft’s Power BI CAT team with a primary focus on front end reporting and user experience.  In his day job he spends time working with enterprise customers to help skill up report authors and guide design for high impact reporting.  Check out his blog at www.alluringbi.com for tips and tricks for the Power BI Creator.
Prototypes without pizza:Power BI latest features

PowerBILogo

Why You Need a Trusted Advisor

I’ve providing advisory services to a Fortune 500 organization for a few months now. As all large organizations, they adopted Power BI Premium. However, they have provisioned only one Power BI Premium P1 node which has been showing signs for overutilization. In the process, I discovered they have purchased 40 Power BI Premium cores with 32 cores left unutilized! In other words, they used 1/5 of what they’ve been paying Microsoft as Power BI Premium fees. How did they arrive at this unfortunate situation?

A year or so ago, they used the Power BI Premium Calculator to estimate the licensing cost on their own. They plugged in 10,000 users and got a recommendation for 5 P1 nodes (or 40 cores). And that’s what they bought, assuming that they will get a cluster of five P1 nodes that would load balance the reports across nodes. When they set up Power BI Premium, they set up only one P1 capacity and all the important reports got in there. And these reports have been running just fine for a long time with thousands of users … The other 32 cores – sitting ducks.

Power BI Premium doesn’t currently load balance across nodes. Once you’ve licensed a certain number of cores, it does give you the flexibility to provision nodes of different sizes. So, with 40 licensed cores, you can set up 5 P1 nodes, or 2 P2 nodes and 1 P1 node, or one P3 node and 1 P1 node. Unfortunately, the flexibility ends there. Once you provision the capacity, the meter is running use it or not. And if you’re not using all these licensed cores, you still pay unless you manually scale down the capacity. I hope to see Power BI Premium Serverless sometimes in the future, where the service scales up and down on demand and you pay for what you really consume. This will void the decision on how much capacity and nodes you need, probably save you lots of money, and make Power BI Premium more affordable.

The moral of this story? Hire a trusted advisor and technology expert, and save a lot!

Conquering Time Zones

A client has Power BI models connected to Dynamics Online. Dynamics stores all dates in UTC instead of keeping the time offset, such as 7/14/2020 1:21:29 AM +00:00. Naturally, the users want to see dates localized to the US Eastern Time zone. Easy, right? Use the Power Query ToLocal time transformation (in the Transform ribbon, expand Time, and then click To Local) to offset with the desired number of hours. But there are a few issues with this approach:

  1. It will work fine in Power BI Desktop (assuming you are on the Eastern Time zone because Power BI will pick your computer settings), but it won’t work when you publish the dataset to powerbi.com. The reason for this is that Microsoft sets the time zone of Azure servers to UTC irrespective of the geo location of the Power BI data center. So, when Power BI Service refreshes the dataset, it will convert dates to its local time (UTC) and the net effect is that the date remains unchanged.
  2. You can attempt using the M DateTimeZone.SwitchZone function but for some obscure reason, casting the column data type to Date after the zone change puts the date back to UTC.
  3. The hour offset changes depending on when the country switches to Daylight Saving Time.

When the going gets though, the developer writes some code which in this case would be a Power Query ToTimeZone query function.

# open a Power Query blank query in Advanced Query mode and paste the code. Then rename the query to ToTimeZone
let
 Source = (UTCTime, TimeOffset) =>
let
 CurrentYear = Date.Year(DateTime.FixedLocalNow()),
 DST_Start = Date.StartOfWeek(#datetimezone(CurrentYear,3,14,0,0,0,TimeOffset,0),Day.Sunday) + #duration(0,2,0,0),
 DST_End = Date.StartOfWeek(#datetimezone(CurrentYear,11,7,0,0,0,TimeOffset+1,0),Day.Sunday) + #duration(0,2,0,0),
 OffsetTime = if UTCTime >= DST_Start and UTCTime < DST_End then TimeOffset + 1 else TimeOffset,
 #"Change Time" = if UTCTime=null then null else DateTimeZone.RemoveZone(UTCTime) + #duration(0, AdjustDST, 0, 0)
in
 #"Change Time"
in
 Source

The function takes two arguments: UTCTime and TimeOffset. For each row, the source query will pass the date as of datetimezone data type. The TimeOffset is the default time offset outside Daylight Saving Time. So, for US Eastern time zone, the time offset will be -5. Notice that I don’t specify types for the arguments because you may have empty (null) dates. If null is passed to UTCTime, you’ll get a runtime error. Instead, the last line checks if the incoming date is null. The function adjusts the time offset during Daylight Saving Time. Finally, it removes the time zone from the original date, so we get the date in a datetime format and then adds the offset. So, 7/14/2020 1:21:29 AM +00:00 will become 7/13/2020 9:21:29 PM.

Once the ToTimeZone function is ready, we can call it from the source query, like so:

let
 Source = OData.Feed("https://<tenant>.crm.dynamics.com/api/data/v9.1/appointments ", null, [Implementation="2.0"]),
 ToLocalTimeZone = Table.TransformColumns(Source, {{"createdon", each ToTimeZone(_, -5)}, {"scheduledstart", each ToTimeZone(_, -5)}}),
 #"Changed Type" = Table.TransformColumnTypes(ToLocalTimeZone,{{"createdon", type date}})
in
 #"Changed Type"

Notice that I use the TransformColumn transformation so that I can convert multiple date columns in one line and don’t end up with additional custom columns. In other words, I’m updating the date columns in place. Finally, I change the data type to date so that I can join to a Date dimension.

You can easily extend this code to handle times in multiple time zones, such as to report the time depending on the geo location of the user who entered the data. This will require storing the user time zone, such as in the Dynamics systemusers table.

Designing Responsive Power BI Reports (Part 2)

In my previous post about responsive reports, I said that changes to the report design can dramatically reduce the report load time. But you will undoubtedly reach a point where you cannot optimize the report any further. After all, having a page with a single visual might be fast, but I doubt it would deliver much business value. What else can be done to speed up the reports and ideally achieve a page load time of a few seconds? Our next stop for that project is to revisit the solution architecture. The client initially opted for a hybrid architecture where the reports use live connection to on-prem SSAS Tabular models. However, a significant time was spent in Power BI showing “loading data” before rendering the report. What took so long?

In this case, Power BI was hosted in the Azure West region, the Tabular server in the company data center in Atlanta, and most users were in the East region. This resulted in round trips crossing the country for each visual.

  1. A user in the East region requests a report which is hosted in the West region.
  2. Each visual generates a query which is sent to Tabular in the company data center.
  3. SSAS returns data back to the West region.
  4. Report payload is sent to the user in the East region.

Ideally, users, reports, and data should be in the same region. However, a long-standing limitation of Power BI is that once the first user signs up for Power BI, Power BI determines the data region based the user location and you can’t move it unless you open a support ticket with Microsoft. Power BI Premium will let you create premium capacities in a different region, but by default these capacities will be created in the original Power BI region and you can’t change the region once the capacity is created either. The chart below shows the round-trip latency doubles if the user is in the East region, but Power BI is in the West.

We decided to move the semantic model to Power BI so that Power BI owns the data. Besides potentially improving the report load time, this architecture has also other important advantages (to learn more, read my “Power BI Large Datasets: The Good, the Bad, and the Ugly” post). If you’re not on Power BI Premium, that “movement” might not easy if you have opted to use Visual Studio or Tabular Editor for development. That’s because Power BI Pro doesn’t expose the XMLA endpoint, so your only option is to migrate the model to Power BI Desktop. But migrating an SSAS Tabular project to Power BI Desktop is not officially supported and there is no automatic migration path.

Thanks to the enhanced dataset metadata (a better name would have been “metadata done right this time”), you could opt for cracking the Power BI Desktop file and replacing the model schema (my “Power BI Source Control” blog has the steps to get to the schema). You might get lucky especially if your Tabular project uses structured data source (the only data source type supported by Power BI Desktop). However, if it uses legacy data sources, you must change every table to use a structured data source and making changes to the JSON schema by hand and Power BI Desktop refusing the load the model with every typo is no fun.

On the other hand, Power BI Premium lets you deploy the SSAS model to the workspace XMLA endpoint. The prerequisites are:

  1. Upgrade the Tabular model in Visual Studio to 1500 compatibility mode. This adds special annotations to the data source. If you don’t upgrade to 1500, you’ll find that you won’t be able to schedule the dataset for refresh as Power BI Service won’t show any connections.
  2. Enable the XMLA endpoint for read-write in the capacity settings.
  3. Enable the “Export data” tenant setting in Power BI (to learn more about this setting, read my “A False Sense of Security” blog).
  4. Like the “Export data” battle wasn’t enough, we found that we have to beg the Security group to enable also the “Analyze in Excel with on-premise datasets” tenant setting. I have no idea what this setting has to do with the XMLA endpoint. Luckily, Fidler showed us the exact error message that this setting must be enabled.
  5. In Visual Studio, change the model processing option to “Do not process”. The “Troubleshoot XMLA endpoint connectivity” article has the details.

Moving the semantic model paid big dividends. We could only see “loading data” for a moment. We also found that the network speed of the user connection and the browser play a significant role. For best performance, users should use Chrome or new Edge and have a broadband connection. We were able to reduce further the load time of the first page by enabling the published dataset for query caching which is only available in Power BI Premium.