PROLOGIKA NEWSLETTER SUMMER 2023

Enterprise Resource Planning (ERP) systems capture a wealth of information from main business processes within an organization, such as Sales, Inventory, Orders, etc. Therefore, a BI journey typically starts with analyzing the ERP data, with revenue analytics usually taking the center stage because of its vital metrics, such Revenue, Margin, Profit. In fact, this scenario has repeated so often for the past few years that Prologika has developed a fast-track implementation for ERP analytics. The client presented in this case study had more complicated integration needs, but other projects for both larger and smaller clients in the same or other industries have benefited from the same methodology.

Business Needs

The client manufactures and distributes electrical components. The company has grown rapidly through acquisitions, with every business unit having its own ERP system, including Dynamics and other business-specific ERPs. Some ERPs were installed on premises, such as Dynamics that stores data in a SQL Server database on the company’s data center, and others were cloud systems.

The client realized that it’s difficult to analyze performance across business units and disperse ERP systems. They envisioned a centralized and fully automated reporting solution that will consolidate data from all ERPs into a single repository so they can track sales across business units, sales people, vendors, and customers. In addition, they wanted their developers to enhance and maintain the solution, so simplicity and knowledge transfer were paramount.

Solution

After assessing the current state and objectives, Prologika recommended and implemented the following cloud architecture:

Prologika adopted an iterative approach to integrate the data from all five ERP systems where each ERP had its own iteration. Laying out the foundation, the first iteration was the most important and had the following main deliverables:

  1. Schema-first design – Prologika analyzed the requirements for Sales analytics and designed a star schema. The schema design was crucial because all other layers, such as ETL and semantic model, depend on it.
  2. Endpoints – Prologika worked with the ERP subject matter expert to implement endpoints that will provide a data feed for each table in the schema.
  3. ETL– Prologika used Azure Data Factory (ADF) no-code pipelines. The integration effort was dramatically reduced because all ERP endpoints would adhere to the same specification and hide the underlying details for each specific ERP, such as which tables would need to be joined. For best performance and fast knowledge transfer, Prologika followed the ELT pattern to transform and load the staged data using T-SQL stored procedures.
  4. Data warehouse – Prologika staged, transformed, and loaded the data into a centralized data warehouse repository whose structure was based on the star schema design in step 1.
  5. Semantic model – Prologika implemented a Power BI semantic model that delivered instantaneous insights and single-version of truth irrespective of the reporting tool used (Power BI, Excel, or third-party tools). The semantic model included the desired business metrics, such as time intelligence, variances, as well as data security.
  6. Reports – Power BI report “packs”, such as Sales Analytics, Order Analytics, were developed that connected to the semantic model. Users also could use Excel and other tools to create pivot reports without extracting and duplicating data.

Benefits

The solution delivered the following benefits to the client:

  • Fast data integration – Each subsequent ETL iteration could benefit from the common architecture.
  • Data consolidation – Data was consolidated into a single repository.
  • Low licensing cost – The solution had a licensing cost in the range of $400/month, with the most spent in Power BI licensing.
  • Easy maintenance – The client was able to quickly learn, maintain, and enhance the solution.


Teo Lachev
Prologika, LLC | Making Sense of Data
logo

Prologika Newsletter Spring 2023

There has been a lot of noise surrounding a data lakehouse nowadays, so I felt the urge to chime in. In fact, the famous guy in cube, Patrick LeBlanc, gave a great presentation on this subject to our Atlanta Power BI Group and you can find the recording here (I have to admit we could have done better job with the recording quality, but we are still learning in the post-COVID era).

What’s Data Lakehouse

According to Databricks which are credited with this term, a data lakehouse is “a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.” It other words, it’s a hybrid between a relational data warehouse and a data lake. Sounds great, right? Visualizing this in Microsoft parlor, the last incarnation of the lakehouse architecture that I came across looks like this:

The Good

I’m sure that many large companies or companies with complex data integration needs could benefit from a similar architecture. As I said many times, staging data to a lake is a good thing when you must deal with files. For example, some cloud vendor that hasn’t matured enough to give direct access to your data, could decide to push files instead (I described a similar scenario in this blog). A “network share” on steroids, the data lake is the best place to store files. A good question here and the one I personally struggled with would be “what if the data comes from relational databases or from REST APIs?” Should you stage that data in a data lake as files before it flows into the data warehouse? A wise consultant’s answer here would be “it depends”. Here are some good reasons when this might make sense.

  1. Stage data first – For some, a large ISV company (see related newsletter here), had to integrate data from many databases with similar but not the same schema. They preferred to stage the data to a data lake and figure out the integration “mess” caused my schema discrepancies and data quality later.
  2. A glorified archive – For example, in case you want to reload the data, you can do it from the lake in the case where the source systems truncate data. However, my personal preference to address this scenario would be to stage the data into a relational Operational Data Store (ODS), especially in the case where changes must be tracked. In a nutshell, if I’m given a choice between a file or relational database, I’d go with the latter.
  3. Synapse – If you decide to host your data warehouse in a Synapse dedicated SQL pool and use Azure Data Factory (ADF) to load the data, ADF will stage the data to Azure Data Lake Service (ADLS) anyway to load it faster into Synapse. Another good thing for Synapse here is that you can use Synapse Serverless to query that data using SQL which might come handy (I share some “serverless” lessons learned here).
  4. Data science – There are some good reasons why data scientists prefer files instead of loading the data from a relational database. Or so I was told (I’m not a data scientist).
  5. Uniformity – If your organization prefers a uniform data flow path despite the additional effort, inconvenience, and redundancy, then this might make sense. Then despite the source data type (structured or unstructured), all data follows the same ingestion pipeline. Just make sure to hire more ETL developers.

Outside these considerations, when you can connect directly to the data source, staging data to files is probably overkill as files are notoriously difficult to deal with.

The Bad

Now let’s look at the so-called zones in the lake: raw, enriched and curated, sometimes also referenced as bronze, silver, and gold. The idea here is to enrich the staged data. So, the raw zone has the staged data 1:1 as in the source. Then let’s say a data scientist needs some enrichment, and we spin more ETL to add a bunch of columns to some file. And then Business needs to reference the data that might require more enrichment. So, into the ETL rabbit hole we go again.

The problem is that many people take this architecture verbatum, whether it makes sense or not. A question came from the audience during Patrick’s presentation “What data do we add to these zones?” How do we know when it’s time to move to the next zone? And the answer here is that these zones are just a recommendation that someone has come up with. A large organization might benefit from them. But in most cases in my opinion spinning more and more ETL and moving data around just so that you follow some vendor’s best practices, makes no sense. And should you stage the data 1:1 from the source? In some cases, like the Get Data First aforementioned scenario, it might make sense. But in most cases, it would be much more efficient to stage the data in the shape you need it, which may necessitate joining multiple tables at the source (by the way, a relational server is the best place to handle joins).

The omni-presence of Synapse in such architectural diagrams is questionable at least. As I stated in another newsletter, like a red giant star, Synapse seems to engulf everything in its path in order to increase its value potential. But Synapse shouldn’t be a default choice for most organizations. It’s rather expensive and has limitations, such as lacking important T-SQL features.

Finally, Spark/Databricks that orchestrates the data preparation with Python or some other custom code since all the toolset you get is a notebook with a blinking cursor. What happened to low code, no code approach? More ETL developers to the rescue…

The Ugly

The omnipresence of the delta lake regardless if it makes sense or not. I’m sure that some scenarios for staging changing data into a lake, such as IoT streaming, will benefit greatly from a delta lake. But it shouldn’t be a default recommendation. The moment we introduce a delta lake, our tool choice becomes rather restricted because of the file format. On ETL side of things, for example, you must use data flows with Azure Data Factory (I’d personally favor ELT over data flows). And to read the data, you must provision either a Spark cluster or Synapse Serverless. So, complexity increases together with cost while data accessibility decreases.

UPDATE 05/28/2023 Microsoft Fabric embraced the delta format as the its native storage but provides more options, including Power Query dataflows and Azure Data Factory copy activity, to load data. All Fabric services save and read data from the delta lake and you don’t have to provision anything.

And if you go with Databricks (credited for inventing the delta lake too), they are far more ambitious . They want to replace RDBMs for OLAP (OLTP won’t work with a delta lake for performance reasons). We’ve seen similar claims before and how they ended. Another question came from the audience during the presentation was if a lakehouse can deliver the same performance as a relational database. One house must be redundant, right? True, after rewriting their software, Databricks can deliver some decent performance (they even claim to be the world’s fastest “data warehouse” although only one other vendor submitted results to that specific benchmark). James Serra (Data & AI Solution Architect at Microsoft), whose excellent blog discusses these topics in detail, recently gave our group a presentation and said that anyone he knows of that has tried replacing a relational data warehouse with a data lake, has failed. Enough said.

What’s a best practice? A best practice to me is adopting the most efficient way to achieve something without sacrificing too much flexibility for what might be thrown at you in the future. To me, a lakehouse as a replacement for a relational data warehouse or as a default staging area is as big of a hype as Big Data was, with all the vendor propaganda surrounding it to buy stuff you don’t need. Large organizations with complex integration needs might benefit from the lakehouse architecture shown above. However, most companies could save a lot of implementation, maintenance, and licensing costs by simplifying it and judicially introducing pieces when it makes sense.


Teo Lachev
Prologika, LLC | Making Sense of Data
logo

Prologika Newsletter Winter 2022

A few days ago an exalted customer shared that they’ve acquired Synapse and now they’re ready for Power BI modeling. He just wasn’t sure how to give business users access to Synapse so cool self-service BI can finally start. In the process of the conversation, it became clear that he opened Synapse Studio and was left with the impression that Synapse has semantic modeling features. This is what happens when Marketing gets involved and people get confused about what a tool actually does. This newsletter attempts to clear up this confusion.

What’s Synapse?

Think of Synapse (aka Azure Synapse Analytics) as a marketing name that spans multiple unrelated (or rather loosely) related services that are sold separately but are bundled together to fulfill a vision of a “unified analytical platform”. This vision is further emphasized by Synapse Studio – an online tool to work with and monitor the Synapse services.

Like a red giant star, Synapse seems to engulf everything in its path.

Let’s explain each service in the order it’s listed in the Azure pricing calculator. Again, each service has its own pricing model, and the “bundle” doesn’t give you any price break.

  • Data Integration – This is Azure Data Factory, which is typically acquired and installed as a standalone service. Why would you want to create ADF pipelines inside Synapse Studio instead of ADF Studio is beyond me. Another caveat to watch for regarding data integration is that Microsoft seemingly emphasizes the role of ADF data flows (at least there is a separate “Data flows” section in Synapse Studio) despite that the ELT pattern is a best practice to load data into the SQL dedicated pool.
  • Data Warehousing – Synapse comes with a preconfigured “serverless” pool that can be used to virtualize data stored in Azure Data Lake. This is a very useful service that allows you to query data in ADLS files using T-SQL. Check this case study to learn how Prologika used this feature in a real-life project. This tab also provides pricing for a dedicated SQL pool but since there is a separate tab for it, I’ll cover it further down.
  • Big Data Analytics – You can optionally provision an Azure Spark pool to process data or apply ML at scale using the Microsoft implementation of Apache Spark.
  • Log and Telemetry Analysis – A recently introduced type of pool for analyzing large volumes of data streaming (i.e. log and telemetry data) from applications, websites, or IoT devices using Kusto Query Language (KDL).
  • Dedicated SQL Pool – This is your SQL Server (or rather Azure SQL Database) on steroids for storing and querying massive data volumes that was previously known as Azure SQL DW. While you gain scalability, you lose various T-SQL features so don’t think that you can seamlessly migrate your on-prem SQL databases to Synapse. Also, for now, a dedicated pool is limited only to a single database.
  • Azure Synapse Link – Another recently introduced service to automatically synchronize data from Azure Cosmos DB and SQL Server 2022 (without using change data capture).

What Synapse is not?

Despite that Synapse Studio might trick you otherwise, Synapse is not:

  • Synapse is not a semantic modeling tool. Although you’ll see a Power BI section in the Develop tab of Synapse Studio, modeling is still done with Power BI Desktop (or other professional tools) and published to Power BI. As with ADF, why would a developer want to register your Power BI artifacts in Synapse Studio is another thing that escapes me.
  • Synapse is not a data integration tool, master data management tool, or data cataloging tool.
  • Synapse shouldn’t be your default option for data warehousing in the cloud. In my experience, Synapse would be an overkill for data processing needs of most companies because there are more cost-effective options for SQL Server in the cloud with less data.


Teo Lachev
Prologika, LLC | Making Sense of Data
logo

Prologika Newsletter Fall 2022

BI implementation shortcuts are tempting and disguise themselves as cost-effective. True, you can slap a Power BI dataset on top of your data (the cornerstone of self-service BI), but most BI implementations will greatly benefit from a datamart, irrespective of the propaganda of self-service BI vendors. I explain the benefits of datamarts vs Power BI datasets directly on data sources in my blog “Datasets vs. Datamarts”. This newsletter discusses a newly released Power BI feature that attempts to simplify the implementation of datamarts by business users.

 

Understanding Datamarts

Let’s define a datamart as a centralized repository for hosting clean and trusted data that is typically organized in a star schema, i.e. as a set of fact tables and dimension (lookup) tables. BI pros would typically host a datamart in a relational database. Nowadays, a cloud-based datamart, such as a datamart hosted in Azure SQL Database, Azure SQL Managed Instance, or Azure Synapse, is the norm. Once the datamart schema is designed, a BI pro would use a professional tool, such as Azure Data Factory (ADF) to load the datamart periodically, such as once every night.

As Microsoft announced here, Power BI datamarts are mean to democratize datamart implementations by business users. In my opinion, this is yet another premium feature, spearheaded with great vision and effort, but questionable practical value. In a nutshell, a Power BI datamart is a combo of Power BI Premium and a Microsoft-hosted Azure SQL Database aiming to simplify the implementation of a departmental datamart.

The Good

Unlike other vendors, such as Domo and their proprietary and overly expensive stack, Microsoft has decided to go with somewhat open solution consisting of tools that Power BI users already know: Power Query, Power BI Desktop (for the first time some of its modeling features, such as relationships and DAX measures, made it to the cloud), and SQL Server. Microsoft provisions the database for you although surrounds it with some red tape (more on this in a moment). Thus, a business users aiming for “no code, low code” experience will whip out Power Query dataflows that populate the database and then build a model (dataset) directly in Power BI Service. Obviously, the main goal is to simplify the experience as much as possible where all the action happens online.

It’s nice that Microsoft chose hosting the data in a SQL database instead of a “lakehouse”. Apparently, they learned some painful lessons from Power Query CDM folders. The database size is up to 100 GB which is not bad at all. I also like that some Power BI modeling features finally made it to Power BI Service, but unfortunately only in the confines of the datamart feature.

The screenshot below shows how a business user can use Power Query dataflow to transform and save the data into a Power BI datamart.

The Bad

From the announcement, “Best of all, IT doesn’t have to worry about getting all data into centrally governed data sources, thus providing discipline at the core and flexibility at the edge.” I failed to see how this will provide “discipline at the core” – a tenant that Microsoft learned from their own pain points after tilting too much toward self-service BI. I’ve also seen statements online that business users don’t have to “consult with IT anymore” when implementing datamarts. Really? What happened to managed self-service BI? I’m sure IT will be thrilled having corporate data in Microsoft-owned databases that they can’t manage and queries running amuck and consuming precious premium resources. Luckily, the admin portal has a switch to control who can create these datamarts. I hope at least we have a BYO (bring your own) database feature at some point.

The elastic Azure SQL database that Microsoft provisions is read-only, meaning that you can’t extend it. I’m a big fan of pushing calculations as much upstream as possible to SQL Server, such as by implementing SQL views, but we can’t do that. Instead, we would use Power Query (what else of course) for all data transforms. But I have serious reservations against Power Query dataflows – a tool that is known to cause performance issues without providing any troubleshooting and maintenance insights.

The Ugly

Do we really need this feature? I would argue that what was really needed was extending Power Query with “destinations” where the user can specify where the data would land. If that was implemented, IT could selectively let business users augment the infrastructure set up by IT with self-service ETL (more than likely temporary) that sinks the data into an IT-sanctioned database.

Further, it would have gotten us out of another proprietary mess that forces dataflows to save their output into CDM folders that make sense only to Microsoft (see my “Power BI Dataflows vs ADF Mapping Data Flows” blog for the gory details). Want to save dataflow data somewhere else? For now, you must use Power BI datamarts because this is the only way you can have your data in a (Microsoft-provided) relational database and nowhere else.

Conclusion

Recently, an enterprise client decided to migrate all self-service Alteryx flows to IT-governed ADF pipelines. More than likely, Power BI datamarts will be heading in that direction. Be very careful about any pure self-service features, as you might find yourself in a bigger mess that you tried to solve.


Teo Lachev
Prologika, LLC | Making Sense of Data
logo

Prologika Newsletter Summer 2022

The workhorse of any modern BI solution is the semantic model that provides unparallel performance and contains business logic and security roles. Microsoft BI gives us three options to host semantic model: SQL Server Analysis Services (SSAS), Azure Analysis Services (AAS), and Power BI.  This newsletter explains why it might be beneficial to consider AAS and includes a script for automating resuming, processing, and synchronizing a model hosted on AAS.

Why Azure Analysis Services?

Microsoft BI practitioners have three options for hosting semantic models: SSAS (on prem), Azure Analysis Services (cloud), and Power BI (cloud). AAS is somewhat caught between a rock and a hard place. Given that Power BI gets the most attention for cloud deployment, why would you consider AAS at all? There are two main reasons:

  1. Cost – Organizational semantic models might require a lot of memory and crunching power. Hosting them on AAS might be more cost effective. For example, AAS S4 runs at around $5,000 which at the same price point as Power BI Premium P1. However, it gives you 100 GB of RAM and 20 cores, whereas P1 has only 25 GB and 8 cores.
  2. Scaling out – A feature unique to AAS is ability to scale out to multiple query replicas. This is not an option with Power BI Premium, and it requires quite a bit of setup with SSAS. However, AAS makes scaling out easy by just changing a slider. And once you’re done, you can pause it, so it doesn’t incur cost!

Automating the solution

Scaling out proved to be a useful feature lately when a client wanted to process massive queries in parallel. We cloned the model to AAS and wrote an ETL job to parallelize the query execution.

Note that the number of replicas depends on the data region and pricing level. For example, only East US 2 and West US support up to 7 query replicas up to S4. Another thing to watch for is that it’s not enough to just process the model on a scale-out farm. You’d need also to synchronize it across the query replicas. This could be done manually in the Azure Portal or automated, such by using the PowerShell script below that you can plug in a SQL Agent job. The script uses a regular AAD account which has admin rights to the server. You can also use a service principal, but I opted for a regular account because Microsoft removed the option for no expiration date for the client secret (the maximum lifetime of a client secret now is two years).

Import-Module Az.AnalysisServices
$password = "<account password>" | ConvertTo-SecureString -asPlainText -Force
$username = "craas@<domain>.com"
$aasendpoint = "asazure://aspaaseastus2.asazure.windows.net/crliveaas1"
$aasendpointmgmt = "asazure://aspaaseastus2.asazure.windows.net/crliveaas1:rw"
$TenantId = "<tenant id>"
$credential = New-Object System.Management.Automation.PSCredential($username,$password)
$defaultProfile = Connect-AzAccount -Credential $credential -Tenant $TenantId

Set-AzContext -Tenant $TenantId -DefaultProfile $defaultProfile
$server = Get-AzAnalysisServicesServer -ResourceGroupName "crliveaas_rg" -Name "crliveaas1" -DefaultProfile $defaultProfile
if ($server.State -eq "Paused")
{
    Resume-AzAnalysisServicesServer -Name "crliveaas1" -ResourceGroupName "crliveaas_rg"  
    #process database; ClearValues removes the data to reduce the memory footprint
    Invoke-ProcessASDatabase -Server $aasendpointmgmt -DatabaseName "<databasename>" -RefreshType "ClearValues" -Credential $credential
    Invoke-ProcessASDatabase -Server $aasendpointmgmt -DatabaseName "<databasename>" -RefreshType "Full" -Credential $credential

    # sync database
    Add-AzAnalysisServicesAccount  -Credential:$credential -RolloutEnvironment:"aspaaseastus2.asazure.windows.net"
    Sync-AzAnalysisServicesInstance -Instance $aasendpointmgmt -Database "<databasename>" -PassThru
}

Conclusion

When it comes to cloud deployment of Analysis Services semantic models, Power BI is preferable because you’re always on the latest features. However, Azure Analysis Services can help you reduce cost and scale out query execution – feature that Power BI doesn’t support.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

Prologika Newsletter Spring 2022

Nowadays, it’s unlikely to envision a data analytics solution without ingesting data from some cloud vendor. Unfortunately, as many of you have found out the hard way, moving to the cloud, such as moving your on-prem ERP system to the vendor’s cloud offering, comes at a huge burden – you relinquish control to the vendor and lose access to your data. In this letter, I’ll present several options for extracting your data out of the vendor’s realm.

I previously ranted here about cloud prohibitors – cloud providers that negate the benefits of moving to the cloud by enforcing all sorts of hoops to get to your data. This blog summarizes a few data integration options with cloud vendors that I’ve used in my projects, ranked from best to worst

Direct access to the database

This is by far your best option. If the provider allows you to access the data by directly connecting to its native storage, you should take it even if it involves higher fees, such as upgrading to a “premium” tier (a travesty considering that you didn’t have pay higher fees with their on-prem installations). There are no insurmountable engineering barriers for the vendor to provide direct access to the data in its native storage, but very few cloud vendors do it citing “issues”, such as security, impact on operational processes, etc. Here is a free piece of advice to cloud vendors: provide direct access to the underlying relational database and use this as a big differentiator against the competition. Security and performance should be on you and not the customer.

Data staging

If the direct access is not an option, the second best would be for the vendor to stage the data out, ideally to a relational database you provision, such as by using the Dynamics Data Export add-in to export data from Dynamics Online (now regretfully deprecated by Microsoft in favor of staging to data lake). Unfortunately, it looks like the norm nowadays is to export to a data lake as flat files and the extraction path becomes relational database -> flat files -> relational database. I hope that makes sense to you because it doesn’t to me. Things to watch out here for are where is the data lake located (Azure, AWS, others) and what integration options are provided. For example, only major ERP vendor supports only S3 as a data lake and provides and a JDBC driver to connect it (JDBC is a Java-based connector that can’t be used by any Microsoft-based integration tool.

Data push as flat files

Smaller and more flexible cloud providers might be willing to push the data to a storage that you provide. In such cases, you should strongly consider Azure data lake instead of FTP. For example, one cloud provider I integrated with used Ruby to call the Azure API to post data extracts to the client’s data lake storage that we set up.

REST APIs

Most cloud integration scenarios will fall in this bucket so that the vendor ensures “secure” and “controlled” access to data. The reality is that many of these REST APIs are horrible in both implementation and throughput. For example, the REST APIs of another ERP vendor couldn’t handle a batch export of 500,000 rows (a dataset that can fit into an Excel spreadsheet and be emailed around mind you). Their server would time out because of the “massive” data, and the client was asked to use callback APIs to chunk the export which of course wouldn’t either. And complexity only grows from here as some implementations require result paging, error handling, etc. forcing you to write custom code.

Manual export

If the above options are not possible, your last resort might be to use the vendor app and export the data manually. So much about automating the data integration…

Conclusion

When evaluating a cloud vendor, don’t forget to ask them how you can access your data. Extracting data by connecting to its native storage (typically a relational database) is the best option for integrating with cloud data sources, as the data is in its native format, you can fold operations, such as filtering and sorting, you can extract data incrementally, or even build SQL views if the vendor allows it. Unfortunately, the norm nowadays is to force you to call the vendor’s (often horrible) API, so the extraction effort in your ETL is on you and not the vendor.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

Prologika Newsletter Winter 2021

Happy Holidays! More and more organizations consider data virtualization to abstract the underlying storage and integrate siloed sources. In this letter, I’ll discuss a real-life project that used PolyBase to expose third-party ERP data as SQL tables. Before I get to the subject of this newsletter, I’m excited to announce the seventh edition of my “Applied Microsoft Power BI” book. It should be available on Amazon in the first days of 2022. As far as I know, it’s the only book that is updated annually to keep it up to the date with the fast-changing Power BI. Stay tuned for a future blog with more details about the book.

 

Business Case

Think of data virtualization is a logical data layer that integrates enterprise data across various on-premises and cloud sources. A large, multinational chemical manufacturer decided to migrate their on-premises ERP system to the cloud. As usually happens, the tradeoff for embracing the cloud is losing access to your data in its native storage. Previously, the client could readily integrate the ERP data stored in a SQL Server database. But the ERP vendor didn’t support this option in their cloud offering. The usual explanation cites security and performance issues, although none of them really hold water. The ERP vendor could have supported a premium tier where data is exposed privately without affecting other customers, and report queries could have been redirected to a secondary replica. This is no different that securing and scaling an Azure SQL Database. Alas, as more and more companies find when embracing the cloud, the integration burden gets heavier and is on them and not on the vendor.

To make things even more difficult, the vendor had a replication mechanism to export the data to AWS S3 data lake. Realizing that most clients would struggle calling their REST APIs, the vendor provided an JDBC driver that abstracted the APIs. Great, except that the client wanted to access the data on Microsoft Azure, but no Microsoft tool supports JDBC drivers because no Microsoft BI tool is written in Java.

Integration Options

One integration option could have been to use a JDBC-capable ETL tool, such as Pentaho. But that would have required implementing integration pipelines to pull the data periodically and stage it on Azure. This presented two issues. First, the data integration effort became more difficult as someone had to own and troubleshoot ETL failures. Second, Business wanted as much real-time access to data as possible. In the past, the business users had implemented self-service Power BI models that they would refresh as needed to cache the data from the on-premises database. However, since the ERP vendor required at least 20 minutes for data changes to be applied to the S3 data lake and ETL needed additional time (even with incremental extraction), the data latency became an issue.

The second option was to somehow virtualize the data. Had the vendor supported exporting the data as files on Azure, Synapse Serverless could have been used to expose the data as virtual tables that can be queried with SQL and loaded in Power BI Desktop. But Serverless doesn’t support AWS S3 and even if it did, the vendor didn’t allow direct access to the staged data (REST APIs and JDBC driver were the only supported options).

PolyBase to the Rescue

The solution I proposed was to use PolyBase which is included in SQL Server and Azure SQL Managed Instance. Ideally, the client wanted a full PaaS solution but only PolyBase in SQL Server supports ODBC. So, we had to use an IaaS VM just to virtualize the data. This diagram shows the solution architecture.

An Azure VM was provisioned with SQL Server 2019 Standard Edition. The JDBC driver was installed on the VM and configured to access the ERP data. We used a third-party JDBC-to-ODBC bridge driver to map the JDBC data source as an ODBC data source. Then, PolyBase external tables were set up to virtualize the ERP data as SQL Server tables.

The main drawback of this solution was that no matter how small the source table was, PolyBase would add about 30 seconds in internal processing. Specifically, the PolyBase runtime log has a detailed trail that shows that it takes some time for PolyBase to “warm up” before it gets to the query, and then it needs even more time to process the results. That’s because, as a distributed system (like Synapse), a head node coordinates the query execution with data nodes even if everything is installed on a single VM. More gotchas specific to the fact that the vendor has decided to use Oracle as their relational database can be found at https://prologika.com/polybase-adventures/.

Conclusion

Microsoft has made bold strides in data virtualization. I’m really impressed by Synapse Serverless, which I used for other projects, such as for the project described in this case study. I wish Microsoft extends Serverless to support more storage options. If Synapse Serverless is not an option, your next best bet would be PolyBase. Although PolyBase is supported in SQL MI and Synapse, only the SQL Server box SKU supports ODBC data sources, requiring an IaaS layer to virtualize the data.

Benefits

The solution delivered the following benefits to the client:

  • No ETL effort – Data was left at the original source.
  • Data virtualization – Polybase was used to create external tables that can be queried just like SQL Server regular tables.
  • Reduced data latency – Data changes were available as soon as they are replicated to the data lake.
  • Scalability – PolyBase can be scaled out to other servers if needed.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

Prologika Newsletter Fall 2021

LabCorp operates one of the largest clinical laboratory networks in the world. It also has an Interactive Response Technology system that healthcare vendors can use to conduct case studies.  Thanks to the cloud data analytics solution implemented by Prologika, LabCorp and its vendors can now analyze data across case studies. Read this newsletter to learn more about the solution architecture and business value.

 

Business Needs

The data for each study was saved into a separate on-prem SQL database. The total number of databases was more than 1,000. After the initial assessment, Prologika realized that one of the main gaps was that vendors couldn’t report across their studies or gain performance insights from studies conducted by other vendors. Further, as the IRT system evolves over time and to accommodate special requests, there were scheme differences between different versions.

LabCorp underscored the importance of consolidating the data from multiple studies into a single repository. They envisioned a cloud-based PaaS BI solution that would extract data from all the on-prem databases without impacting the system performance and centralize it into an enterprise data warehouse. Vendors would log an external portal that will deliver embedded reports. The first iteration was focused on analyzing audit and log data to gain strategic insights, such as how many users are using the system.

Solution

After assessing the current state and objectives, Prologika recommended and implemented the following architecture:

Prologika implemented an Azure Data Factory (ADF) configurable framework to extract data from the on-prem databases hosted on two production SQL Servers in parallel. The framework would stage the data into partitioned parquet files in Azure Data Lake Storage (ADLS).

Then Prologika created views in Synapse Serverless to consolidate the file extracts. For example, an Audit table could exist in several on-prem databases on both production servers. While each table would be staged in a separate file (or multiple files if the table supports incremental extraction), a Synapse Serverless would present a consolidated view across all files. We were impressed by the capabilities and performance of Synapse Serverless.

Another ADF process would extract the data from the Synapse Serverless views and load the data into a data warehouse hosted in a Synapse SQL pool. Although we considered other methods, such as using a linked server in a Azure SQL Managed Instance to Synapse Serverless to avoid data staging, we settled on Synapse SQL mainly for its scalability. Finally, a Power BI semantic model was created and Power BI Embedded used to deliver reports.

Benefits

The solution delivered the following benefits to LabCorp:

  • Data consolidation – Data was extracted with minimum impact to the operational systems and consolidated in ADLS.
  • Data virtualization – Thanks to the Synapse Serverless flexible support of schema differences, virtual views were created.
  • Scalability – Synapse SQL pool can scale almost indefinitely.
  • Secure and fast insights – Each vendor can access only their data.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

Prologika Newsletter Summer 2021

Business Performance Management (BPM) is a methodology to help the company predict its performance. An integral part of a BPM strategy is creating and monitoring a scorecard with Key Performance Indicators (KPIs). In this newsletter, I’ll discuss how the newly released Power BI Goals can help you augmenet your BPM strategy. But before that, I’d like to share my excitement that Microsoft have recently awarded me FastTrack Recognized Solution Architect – Power BI! I’m one of the 33 individuals worldwide who got nominated by the Power Platform product engineering team for consistently exhibiting deep architecture expertise and creating high quality solutions for customers during project engagements.

Introducing Power BI Goals

A vital BI practice for every organization, performance management ensures that important metrics, such as Key Performance Indicators (KPIs), meet established goals. The typical artifact to do so is implementing a scorecard: a report that compares the current state with the desired state of these metrics. You might have also heard the term “balanced scorecard” which is an organization-wide scorecard that tracks several subject areas, such as Finance, Customer, and Operations. In the past, organizations would use different tools, such as the now deprecated PerformancePoint (included in SharePoint Server) to implement balanced scorecards. Realizing the importance of scorecards, Power BI introduced Goals that aim to simplify the process of implementing departmental and organizational scorecards. For more information on how Goals works, watch the “Goals in Power BI” presentation from the Microsoft Business Application Summit

What’s to like
As with anything Power BI, Microsoft has democratized scorecards so business users with no reporting experience can quickly assemble them from existing reports. Think of a goal as a line (or KPIs) in the scorecard. Here is scorecard with two goals:

Currently, Power BI supports two goal types:

  • Static – The goal creator manually enters and track the goal properties, such as current value, target value, and status. This could be useful for quick and dirty KPIs that are not backed by a data source, such as launching a new promotion campaign. In the scorecard above, I created the Revenue goal by entering 5M as the current value and 5.5M as the goal.
  • Data-driven – The goal current value and/or target value can be data-driven and bound to metrics from existing report(s). Coming from Analysis Services, I was initially surprised that Power BI doesn’t require implementing KPI measures, but I get it: Microsoft decided to source the metrics from reports so business users can easily apply filters. If the goal owner chooses a metric from a visual that has a Date field, such as a time series chart, Power BI automatically shows a sparkline for the goal progress over time. And, of course, when the report dataset is refreshed, the goal values are updated.

So, no modeling or Power BI Desktop required assuming that someone else, such as a data analyst, has delivered functional reports with the metrics. Even better, the goal current and target values can come from different reports (even a report in a different workspace if you have permissions), e.g. a report with actuals and another report with targets. So, there is plenty of flexibility here. To mimic a balanced scorecard that spans multiple subject areas, the owner can create subgoals. For example, the main goal could be Finance with subgoals Revenue, Margin, etc. Because like dashboards, goals are “pinned” from reports, the end user can navigate to the underlying report to examine the data in more detail. Users can also add notes to explain the goal behavior to the teammates.

A scorecard is a first-class Power BI citizen, and as such, it can be secured, endorsed, secured with sensitivity labels, annotated, and shared, such as sharing the scorecard to a Microsoft Teams channel. The scorecard data is saved in a Power BI dataset that users can connect to build custom reports. Moreover, Power BI automatically adds daily snapshots to the dataset allowing users to build up a history of the goals. For example, if the underlying report is refreshed daily, the updated goal values will be appended to the dataset. Developers can use the Power BI REST APIs to implement programmatic scorecard management solutions.

What’s not to like

Besides navigating to the underlying report, a goal is a one-liner in the scorecard. I can’t define a goal that shows me a metric sliced by dimension members, such as business unit. Further, subgoals are not currently aggregable, such as to sum or average values when rolling up to the main goal. Like limitations with dashboards, there is no way to apply a global filter to the scorecard, e.g. to filter all goals for the prior month.

Besides current and target values, no other goal properties can be data driven. For example, unlike Analysis Services KPIs, the goal status can’t be currently bound to a DAX measure. Changing the status requires proactive manual “check ins” although Microsoft mentions a forthcoming feature that will let users define rules to change the status, like how you can define rules for dashboard tile alerts. Speaking of data-driven properties, I don’t understand why you must use a date field to get the progress as opposed to any other field, such as Month, in your Date table.

Finally and unfortunately, Goals require Power BI Premium. If we really want to democratize features, shouldn’t we make them available in Pro?

Conclusion

Goals are a Power BI Premium feature aimed at making it easier to create scorecards and monitoring metrics from existing reports. They promote a “bottom-up” culture, where business users can create departmental scorecards without reliance on IT. Microsoft plans more features by the end of the year to make Goals more appealing, such as integration with Power Automate to trigger actions, rolling up subgoals, changing the goal tracking cycle (DoD, MoM, YoY), custom goal formatting, Power BI Mobile experience optimized for phones, providing a scorecard visual, and cascaded goals (hierarchy of goals).

If you find Power BI Goals somewhat inflexible or you don’t have budget to upgrade to Premium, you don’t have to use the Goals feature to implement scorecards. You can define KPIs and create dashboard-looking reports where you have complete control over the scorecard presentation.


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo

Prologika Newsletter Spring 2021


Amidst the COVID pandemic, the Houston Health Department (HHD) had another predicament to tackle. With lab results accumulating rapidly at one million cases per month, the vendor system they used for capturing and analyzing COVID data couldn’t keep up. In this newsletter, you’ll learn how Prologika implemented a BI solution powered by SQL Server and Power BI to solve technology challenges, and deliver fast and reliable insights.

Business Challenges

The vendor SQL Server database had large tables with normalized name-value pairs for each question and answer received from the patient, and for each investigation result. To facilitate reporting, the vendor system had scheduled ETL processes to extract data from the OLTP tables into denormalized tables. However, locking conflicts and large data volumes would crash the system and fail the ETL processes.

As a result, business analysts were not able to get timely data for reporting. HHD and the vendor tried to isolate the reporting loads by replicating the data to a reporting database but the issue with populating the denormalized tables remained

Solution

A good solution starts with a solid foundation. After assessing the current state and objectives, Prologika recommended and implemented the following architecture:

The stand-alone SQL database was replaced with an AlwaysOn availability group. Not only did this provide high availability, but it also isolated operational from reporting workloads.

In the client’s own words “we have compared the cluster server to the report server and cluster is vastly superior with regard to performance for regular queries. One simple run was 4x faster on cluster than the current report server. A much more complex run took four minutes on cluster and I stopped the same run on the report server after 87 minutes.”

Previously, data analysts would produce reports using different tools, ranging from SQL, Python, to Power BI reports. With scarce resources, HHD found it difficult to continue that path. Instead, Prologika implemented a semantic model that was hosted in Power BI.

Benefits

The new architecture and semantic model delivered the following benefits to HHD:

  • A single version of truth – Strategic data and business calculations were centralized on one place.
  • Fast insights – The Power BI reports crunching millions of rows were instantaneous.
  • Isolation – Report loads were isolated from the operational loads because the data was cached in the semantic model.
  • Standardization and centralization – Power BI became the reporting platform for delivering insights across the organization.

 


Teo Lachev
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics

logo