Posts

Top 5 Lessons Learned from IBM Db2 Integration Projects

I’ve done a few BI integration projects extracting data from ERPs running on IBM Db2. Most of the implementations would use a hybrid architecture where the ERP would be running on an on-prem mainframe while the data was loaded in Microsoft Azure. Here are a few tips if you’re facing this challenge:

  1. IBM provides JDBC, ODBC, and OLE DB drivers. The JDBC driver is not applicable to the Microsoft toolset because none of the Microsoft BI tools run on Java runtime. Microsoft provides an OLE DB Provider for Db2.
  2. If you use Azure Data Factory or Power BI Desktop, you can use the bundled Microsoft OLE DB driver for Db2. Here are the Azure Data Factory settings for setting up a linked service using the Db2 driver:
    1. Server name – specify the server name of IP address of the IBM Db2 server. For DB2 LUW the default port is 50000, for AS400 the default port is 446 or 448 when TLS enabled. Example: erp.servername.com:446 to connect to AS400.
    2. Database name – that one is tricky as I’ve found that even the client doesn’t know it because the ODBC/JDBC driver hides it. The easiest way to obtain it is to connect Power BI Desktop to Db2 using the ODBC driver and look at the Navigator window.
    3. Package collection: This one can get you in trouble. If you get an error “The package corresponding to an SQL statement execution request was not found. SQLSTATE=51002 SQLCODE=-805”, enter NULLID as explained in the ADF documentation (see the link above).
  3. To use the IBM drivers, you need to install them on the on-prem VM hosting the ADF self-hosted runtime. Instead, consider using the Microsoft-provided drivers because they are bundled with the runtime and there is nothing for you to install.
  4. Based on my tests, all drivers have similar performance but do upgrade to the latest IBM drivers if you plan to use them. In one project, I’ve found that the latest IBM ODBC driver outperformed twice an older driver the client had.
  5. The read throughput is going to be mostly limited by the mainframe server itself and the connection bandwidth between the data center and Azure. In all projects, the mainframe read speed was by far slower than reading data from a modest SQL Server. For example, one project showed that it would take 2.5 minutes to read one million rows from the production mainframe, while it would take 40 seconds to read the same data from SQL Server.

     

    While optimizing the mainframe performance will likely be outside your realm of possibilities, you could lobby to increase the connection throughput between the data center and Azure if you plan to transfer a lot of data. Aim for 500 Mbps or higher. For example, after 2.5 minutes to read aforementioned 1 million row dataset from the server, it took another 2.5 minutes to transfer the data (about 2.5 Gb) to Azure because the connection throughput for this project was only 100 Mbps.

Fixing ADF DataFactoryPropertyUpdateNotSupported Deployment Error

This took a while to figure out…so someone might find it useful.

Scenario: You have set up two Azure Data Factory instances for DEV and PROD environment. The DEV environment uses a self-hosted integration runtime to read data from on-prem data sources. The day has come to deploy your masterpiece to PROD. You share the DEV self-hosted runtime with the production ADF instance. You export the ADF ARM template and attempt to run it against the pristine PROD environment. You’re greeted with DataFactoryPropertyUpdateNotSupported error. The “property” here is linked self-hosted runtime.

Solution: The following solution worked for me. In your ARM template, scroll all the way to the bottom and add the following to the typeProperties element of the self-hosted runtime:

IMPORTANT: Make sure that in your production ADF, you have created a shared self-hosted runtime that points to the one in your DEV environment before you proceed with the ARM template deployment. 

{
   "type": "Microsoft.DataFactory/factories/integrationRuntimes",
   "apiVersion": "2018-06-01",
   "name": "[concat(parameters('factoryName'), '/mtx-adf-shir-us')]",
   "dependsOn": [],
   "properties": {
       "type": "SelfHosted",
       "typeProperties": {
           "linkedInfo": {
               "resourceId": "Resource ID obtained when you share the self-hosted runtime in DEV",
               "authorizationType": "Rbac"
           }
       }
   }
}

Solving Configuration Errors with ADF Self-hosted IR

You’ve set up the Azure Data Factory self-hosted integration runtime to access on-prem data sources. You create a linked server, click Test Connection, and then get greeted with an error saying the security context can’t be passed. On the on-prem VM, you use the Integration Runtime Configuration Manager and get a similar error or something to the extent that JSON can’t be parsed. You spent a few hours in trying everything that comes to mind, such as checking firewalls, connectivity from SSMS, but nothing helps.

How do we fix this horrible problem? We double the backslashes in the server name (if you use a named instance) and in the user name (after the domain) on the linked server properties. Apparently, Spark/Databricks has an issue with backslashes.

Presenting Analytics in a Day Workshop on August 20th

In partnership with Microsoft, I’m delivering a complimentary, one-day, Analytics in a Day virtual workshop on August 20th, 9 AM – 5 PM Eastern Time. Targeting BI developers, architects and technology decision makers interested in achieving a single version of truth with organizational BI, this workshop is designed to guide and accelerate your journey towards a modern data warehouse to power your business with Azure Synapse, Azure Data Factory, Azure Data Lake, and Power BI.

The first half of the day from 9 am – 1 pm will help you better understand how to:

  • Create an analytics solution that goes from data ingestion to insights using Azure Synapse Analytics and Power BI
  • Empower self-service analytics
  • Enable a truly data-driven culture in your business

Part of the workshop will be dedicated to hands-on training to help you get started on your cloud analytics journey.

The second half of the day from 1 pm – 5 pm is optional and will focus on best practices around designing Power BI + Azure Synapse Analytics BI solutions to enable “Discipline at the core and flexibility at the edge“.

Say goodbye to data silos. Analytics in a Day is designed to simplify and accelerate your journey towards a modern data warehouse to power your business. Reserve your seat today at http://bit.ly/aid202008.

ADF or SSIS, That Is the Question

Microsoft introduced Azure Data Factory (ADF) in 2015 to handle a specific scenario: tumbling window loads for Hadoop and other big data systems for internal MS usage. Then a need arose for a PaaS public-facing data integration platform which led to an interesting dilemma: SSIS or something else? Since SSIS wasn’t designed to scale out in the cloud, ADF rose to the challenge. I could argue that a better approach would have been migrating the SSIS design experience to the cloud to retain existing skills and make thousands of BI developers feel at home, as Microsoft did with SQL Server Database Engine and SSAS, but it’s a moot point now. So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?

To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?

SSIS has been evolving for 13 years (or 20 years if you count DTS), and it has a rich ecosystem of add-ons to tackle challenging data integration needs. ADF, on the other hand, has a humble 3-year history. It would probably take Microsoft another decade to get ADF where SSIS is today. Here are some limitations that you’ll probably run into:

  • ADF supports a limited set of connectors (even Excel is not on the list)
  • There is practically no extensibility, except custom activities (require .NET coding).
  • The ADF Execute Pipeline activity doesn’t support dynamic content for the pipeline name. This means that you can’t dynamically bind the activity to execute pipelines (an ADF pipeline fulfills the same role as a package in the SSIS world) from a list stored in a database table. Although this sounds trivial, it won’t let you implement a comprehensive ETL framework. An ETL framework is important from an architectural standpoint as any self-respected data integration developer would agree. For instance, our SSIS ETL framework complements the SSIS catalog with features that SSIS doesn’t have, such as declarative parallelism, restartability, actual vs target duration monitoring, logging, and others that I discussed in my “Is ETL (E)ating (T)hou (L)ive?” newsletter. Unfortunately, it’s not possible to migrate or implement such frameworks in ADF because of the above-mentioned Execute Pipeline restriction.
  • Continuing on limitations, ADF lags considerably behind SSIS in the areas of troubleshooting and monitoring. For example, if you misconfigure a previous step, you’ll get “Activity failed because an inner activity failed” during debugging with no way to troubleshoot this. This took hours to troubleshoot on a recent project! Of course, ADF is a cloud technology so I hope the momentum behind it continues and the feature gap narrows to bring ADF on a par with SSIS.

As of now, consider Azure Data Factory for pure PaaS solutions (sometimes, this is mandated by upper management and you don’t have much latitude). ADF would be also a good choice when your data integration loads need to scale, such as when you run a lot of ETL jobs that might be difficult to scale on a VM running SSIS. If you decide for ADF, simplify your data integration pipeline as much as possible. Strongly consider the ELT pattern instead of ETL (see that newsletter again if you don’t know what that means). Although all Microsoft cloud architecture diagrams prominently show ADF, my personal preference for cloud data integration projects is still SSIS on a VM with the Azure Feature Pack.

I’d love to hear your feedback.