Posts

App Integration Tools and Patterns

“…That’s when the teardrops start, fella…
…Dirty deeds and they’re done dirt cheap”

“Dirty deeds done dirt cheap”, AC/DC

A few months ago, I did an assessment for a large organization seeking to modernize their data analytics processes. As a part of standardizing the toolset, the client was considering Salesforce MuleSoft for both application and data integration. After all, the vendor told them the tool can do “everything”. Indeed, taken from the vendor page, we learn that “MuleSoft is a platform that gives IT the tools to automate everything. This includes integrating data and systems, automating workflows and processes, and creating incredible digital experiences — all on a single, easy-to-use platform.”

Let’s parse this claim.

Defining integration types

Quick as a bunny and excited that I finally have one tool that does it all, we hop to the Gartner’s Magic Quadrant for Data Integration Tools. Surprise, Salesforce MuleSoft is not even there. Where is MuleSoft? It’s on the Gartner’s Magic Quadrant for Integration Platform as a Service and the Magic Quadrant for API Management. Confused? Let’s start with definitions.

Data integration (note the focus on data)

“Gartner defines data integration as the discipline that includes architectural patterns, tools, and methodologies for accessing, harmonizing, transforming, processing, and moving data across various endpoints and infrastructures. It ensures consistent access and delivery of data across different subject areas and structures to meet business and application needs.”

Integration Platform (note the focus on applications)

“Gartner defines integration platform as a service (iPaaS) as a vendor-managed cloud service that enables end users to implement integrations between a variety of applications, services and data sources, both internal and external to their organization.”

API Management (note the focus on APIs)

“Gartner defines the application programming interface (API) management market as the market for software that supports API life cycle stages such as planning and design, implementation and testing, deployment and operation, and versioning and retirement.”

So, should we use MuleSoft for ETL and data integration patterns typically involving large data volumes, such as to load a data warehouse? I say we don’t. You probably could, but it will be like trying to fit a square peg into a round hole. Did that client try it despite my advice against it? They did and came to the same realization. Should we consider MuleSoft for enterprise-level app integration, such as something happening in one system triggers something happening in another system by calling APIs? Probably, that’s what’s designed to do anyway. Let’s choose the right tool for the job.

Application integration done dirt cheap

What about a true application integration scenario? In a current assessment, a client uses a third-party app which stores data in a SQL Server database. Each subsidiary has its own database that tracks invoicing locally through the app. However, a central NetSuite ERP system is used to consolidate invoices from all subsidiaries. So, the integration task at hand is to implement a bi-directional synchronization where each new local invoice (and its associated line items) is exported to NetSuite and the local invoice is then updated with the corresponding NetSuite transaction id (it’s more complicated than this but you get the idea).

Now this is where we can surely use “specialized” tools like MuleSoft, Boomi, Celigo, etc., right? And we surely pay a premium, to the likes of thousands of dollars per month. I don’t claim to know these tools but for the most part they appear to me as nice UI wrappers on top of the vendor APIs. They certainly avoid dealing with the actual API complexities, JSONs, etc, but should you lock yourself and break the bank for small scale app integration? Nowadays, everyone wants to do more with less.

Therefore, knowing and using the Microsoft Azure stack, my personal choice is to first consider Azure Data Factory. True, it’s a generic data integration tool and I must learn how to call the vendor APIs and deal with JSONs but it’s not a big deal (AI to help here). ADF can even do webhooks. And it will save you tons of money.

A group of people standing around a large device AI-generated content may be incorrect.

 

Top 5 Lessons Learned from IBM Db2 Integration Projects

I’ve done a few BI integration projects extracting data from ERPs running on IBM Db2. Most of the implementations would use a hybrid architecture where the ERP would be running on an on-prem mainframe while the data was loaded in Microsoft Azure. Here are a few tips if you’re facing this challenge:

  1. IBM provides JDBC, ODBC, and OLE DB drivers. The JDBC driver is not applicable to the Microsoft toolset because none of the Microsoft BI tools run on Java runtime. Microsoft provides an OLE DB Provider for Db2.
  2. If you use Azure Data Factory or Power BI Desktop, you can use the bundled Microsoft OLE DB driver for Db2. Here are the Azure Data Factory settings for setting up a linked service using the Db2 driver:
    1. Server name – specify the server name of IP address of the IBM Db2 server. For DB2 LUW the default port is 50000, for AS400 the default port is 446 or 448 when TLS enabled. Example: erp.servername.com:446 to connect to AS400.
    2. Database name – that one is tricky as I’ve found that even the client doesn’t know it because the ODBC/JDBC driver hides it. The easiest way to obtain it is to connect Power BI Desktop to Db2 using the ODBC driver and look at the Navigator window.
    3. Package collection: This one can get you in trouble. If you get an error “The package corresponding to an SQL statement execution request was not found. SQLSTATE=51002 SQLCODE=-805”, enter NULLID as explained in the ADF documentation (see the link above).
  3. To use the IBM drivers, you need to install them on the on-prem VM hosting the ADF self-hosted runtime. Instead, consider using the Microsoft-provided drivers because they are bundled with the runtime and there is nothing for you to install.
  4. Based on my tests, all drivers have similar performance but do upgrade to the latest IBM drivers if you plan to use them. In one project, I’ve found that the latest IBM ODBC driver outperformed twice an older driver the client had.
  5. The read throughput is going to be mostly limited by the mainframe server itself and the connection bandwidth between the data center and Azure. In all projects, the mainframe read speed was by far slower than reading data from a modest SQL Server. For example, one project showed that it would take 2.5 minutes to read one million rows from the production mainframe, while it would take 40 seconds to read the same data from SQL Server.

     

    While optimizing the mainframe performance will likely be outside your realm of possibilities, you could lobby to increase the connection throughput between the data center and Azure if you plan to transfer a lot of data. Aim for 500 Mbps or higher. For example, after 2.5 minutes to read aforementioned 1 million row dataset from the server, it took another 2.5 minutes to transfer the data (about 2.5 Gb) to Azure because the connection throughput for this project was only 100 Mbps.

Fixing ADF DataFactoryPropertyUpdateNotSupported Deployment Error

This took a while to figure out…so someone might find it useful.

Scenario: You have set up two Azure Data Factory instances for DEV and PROD environment. The DEV environment uses a self-hosted integration runtime to read data from on-prem data sources. The day has come to deploy your masterpiece to PROD. You share the DEV self-hosted runtime with the production ADF instance. You export the ADF ARM template and attempt to run it against the pristine PROD environment. You’re greeted with DataFactoryPropertyUpdateNotSupported error. The “property” here is linked self-hosted runtime.

Solution: The following solution worked for me. In your ARM template, scroll all the way to the bottom and add the following to the typeProperties element of the self-hosted runtime:

IMPORTANT: Make sure that in your production ADF, you have created a shared self-hosted runtime that points to the one in your DEV environment before you proceed with the ARM template deployment. 

{
   "type": "Microsoft.DataFactory/factories/integrationRuntimes",
   "apiVersion": "2018-06-01",
   "name": "[concat(parameters('factoryName'), '/mtx-adf-shir-us')]",
   "dependsOn": [],
   "properties": {
       "type": "SelfHosted",
       "typeProperties": {
           "linkedInfo": {
               "resourceId": "Resource ID obtained when you share the self-hosted runtime in DEV",
               "authorizationType": "Rbac"
           }
       }
   }
}

Solving Configuration Errors with ADF Self-hosted IR

You’ve set up the Azure Data Factory self-hosted integration runtime to access on-prem data sources. You create a linked server, click Test Connection, and then get greeted with an error saying the security context can’t be passed. On the on-prem VM, you use the Integration Runtime Configuration Manager and get a similar error or something to the extent that JSON can’t be parsed. You spent a few hours in trying everything that comes to mind, such as checking firewalls, connectivity from SSMS, but nothing helps.

How do we fix this horrible problem? We double the backslashes in the server name (if you use a named instance) and in the user name (after the domain) on the linked server properties. Apparently, Spark/Databricks has an issue with backslashes.

Presenting Analytics in a Day Workshop on August 20th

In partnership with Microsoft, I’m delivering a complimentary, one-day, Analytics in a Day virtual workshop on August 20th, 9 AM – 5 PM Eastern Time. Targeting BI developers, architects and technology decision makers interested in achieving a single version of truth with organizational BI, this workshop is designed to guide and accelerate your journey towards a modern data warehouse to power your business with Azure Synapse, Azure Data Factory, Azure Data Lake, and Power BI.

The first half of the day from 9 am – 1 pm will help you better understand how to:

  • Create an analytics solution that goes from data ingestion to insights using Azure Synapse Analytics and Power BI
  • Empower self-service analytics
  • Enable a truly data-driven culture in your business

Part of the workshop will be dedicated to hands-on training to help you get started on your cloud analytics journey.

The second half of the day from 1 pm – 5 pm is optional and will focus on best practices around designing Power BI + Azure Synapse Analytics BI solutions to enable “Discipline at the core and flexibility at the edge“.

Say goodbye to data silos. Analytics in a Day is designed to simplify and accelerate your journey towards a modern data warehouse to power your business. Reserve your seat today at http://bit.ly/aid202008.

ADF or SSIS, That Is the Question

Microsoft introduced Azure Data Factory (ADF) in 2015 to handle a specific scenario: tumbling window loads for Hadoop and other big data systems for internal MS usage. Then a need arose for a PaaS public-facing data integration platform which led to an interesting dilemma: SSIS or something else? Since SSIS wasn’t designed to scale out in the cloud, ADF rose to the challenge. I could argue that a better approach would have been migrating the SSIS design experience to the cloud to retain existing skills and make thousands of BI developers feel at home, as Microsoft did with SQL Server Database Engine and SSAS, but it’s a moot point now. So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?

To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?

SSIS has been evolving for 13 years (or 20 years if you count DTS), and it has a rich ecosystem of add-ons to tackle challenging data integration needs. ADF, on the other hand, has a humble 3-year history. It would probably take Microsoft another decade to get ADF where SSIS is today. Here are some limitations that you’ll probably run into:

  • ADF supports a limited set of connectors (even Excel is not on the list)
  • There is practically no extensibility, except custom activities (require .NET coding).
  • The ADF Execute Pipeline activity doesn’t support dynamic content for the pipeline name. This means that you can’t dynamically bind the activity to execute pipelines (an ADF pipeline fulfills the same role as a package in the SSIS world) from a list stored in a database table. Although this sounds trivial, it won’t let you implement a comprehensive ETL framework. An ETL framework is important from an architectural standpoint as any self-respected data integration developer would agree. For instance, our SSIS ETL framework complements the SSIS catalog with features that SSIS doesn’t have, such as declarative parallelism, restartability, actual vs target duration monitoring, logging, and others that I discussed in my “Is ETL (E)ating (T)hou (L)ive?” newsletter. Unfortunately, it’s not possible to migrate or implement such frameworks in ADF because of the above-mentioned Execute Pipeline restriction.
  • Continuing on limitations, ADF lags considerably behind SSIS in the areas of troubleshooting and monitoring. For example, if you misconfigure a previous step, you’ll get “Activity failed because an inner activity failed” during debugging with no way to troubleshoot this. This took hours to troubleshoot on a recent project! Of course, ADF is a cloud technology so I hope the momentum behind it continues and the feature gap narrows to bring ADF on a par with SSIS.

As of now, consider Azure Data Factory for pure PaaS solutions (sometimes, this is mandated by upper management and you don’t have much latitude). ADF would be also a good choice when your data integration loads need to scale, such as when you run a lot of ETL jobs that might be difficult to scale on a VM running SSIS. If you decide for ADF, simplify your data integration pipeline as much as possible. Strongly consider the ELT pattern instead of ETL (see that newsletter again if you don’t know what that means). Although all Microsoft cloud architecture diagrams prominently show ADF, my personal preference for cloud data integration projects is still SSIS on a VM with the Azure Feature Pack.

I’d love to hear your feedback.