Is ETL (E)ating (T)hou (L)ive?
Before we get to the subject of this newsletter, I’m happy to announce the availability of my latest class – Applied Power BI Service. As you’ve probably heard by now, Power BI Service (or Power BI 2.0) is the Microsoft latest cloud-based analytics service with a simple promise: 5 seconds to sign up, 5 minutes to wow! If you’re ready to disrupt how your organization is analyzing data, please contact me to schedule this class at your convenience and get immediate value.
What Not To Do
Back to the newsletter subject, let’s start with a story that was inspired by true events as they say in the movies. Not a long time ago, a man wanted a nice and modern house for his family. He hired a couple of well-known builders but they didn’t deliver what the man really wanted. Then, the man hired a third builder who built him a great house (or a close approximation of the grand vision). Everyone was happy and they lived happily ever after…or at least until the man sold the house to another man.
The second owner had more pressing needs and another vision about the house. Not only the house had to accommodate his family but now the house had to entertain hordes of guests so it had to be expanded. But to cut down cost, the second man decided to maintain the house on his own or outsource whatever he can’t do to a cheap builder. The new owner put hastily new rooms and did other renovations as necessary. Expansion and new construction were his highest priorities and there was never time for proper maintenance or to reinforce the house infrastructure so that it can accommodate the new demands. Needless to say, not much time had passed until the infrastructure gave up. For example, it took days for clogged pipes to drain and guests were not happy. Did I mention the man sold his guests the sun and the moon?
What does this have to do with Extraction, Transformation, and Loading (ETL)? Data is rapidly growing nowadays while ETL processing windows are shrinking. You must do more with less. And, ETL usually becomes a performance bottleneck that stands in the way of your current and future BI initiatives
What To Do
How did the story end? The story didn’t end and it will never end. If you have a house, you can just focus on renovations and additions. You must also maintain it and you must budget for it. One day a member of the man’s family did something out of ordinary and the entire infrastructure collapsed. There wasn’t a way to find out why and the family was scurrying around trying to apply quick fixes. Finally, the second man hired hastily the original builder to assess the situation. Among other things that the builder did to resolve the crisis, he recommended changes and proactive maintenance along the following ten tenets:
- Parallelism – The chances are that you have an ETL framework that orchestrates package execution, log errors, etc. And, the chances are that the framework executes packages sequentially. With all the bandwidth modern servers have, there is no excuse if your framework doesn’t support parallel execution. That’s because many ETL tasks, such as ODS loads, loading dimensions and independent fact tables, can benefit greatly from parallel execution. For example, at Prologika we use ETL framework that supports a configurable number of parallelism. Once you configure which packages can run in parallel, the framework distributes the packages across parallel flows.
- Incremental extraction – If you have small data volumes, you might get away with fully loading the source data but most systems would require incremental extraction. Again, this is something the ETL framework is best suited to handle.
- Volume stats – ETL must log in important data volume metrics, such as number of rows extracted, inserted, updated, and deleted. It should also load how many days were processed since the last incremental extraction and additional context that might be useful for troubleshooting purposes, such as what parameters were passed to stored procedures.
- Targeted execution – I recommend you add a target execution duration for each package. Then, ETL will log in the actual duration so that you can detect performance deviations from the norm.
- Daily monitoring – I suggest you implement and publish a dashboard, such as using Excel Power Pivot, and monitor this dashboard daily. For example, the dashboard should include a Package Execution KPI that flags excessive executions in red based on the performance metrics you established in step 4.
- Regression analysis – Once things “normalize”, create an one-time Extended Events session (assuming SQL Server) to capture the query plans for all significant queries. If during daily monitoring you discover a performance deviation, run the session again focusing on that slow package and compare the query plan with the baseline. Analyze both query plans to find if and why they have changed. To make this easier, when SQL Server 2016 ships, consider upgrading to take advantage of the new Query Store feature.
- Cold data archiving – If you lots of source data, e.g. billions of rows, consider archiving historical data that no one cares about, such as by uploading to Azure Table storage.
- Project deployment – Consider upgrading to SSIS 2012 or above to benefit from its project deployment so that you can get task-level performance analysis in addition to easier development.
- Avoid locking – Use “SET TRANSACTION ISOLATION LEVEL READ UNCOMMITED” at the beginning of your stored procedures of freeform SQL to avoid read locks. I prefer using this statement instead of the NOLOCK hint for its brevity and so that I don’t miss a table.
- ELT pattern – I saved the best for last. I’m a big fan of the ELT pattern. I usually try to get out as fast as I can from the SSIS designer. Instead of transformations in the ETL data flow, consider ETL pattern for its performance and maintenance benefits. For more information about the ELT pattern, read my blog “3 Techniques to Save BI Implementation Effort.
As you’d probably agree the BI landscape is fast-moving and it might be overwhelming. As a Microsoft Gold Partner and premier BI firm, you can trust us to help you plan and implement your data analytics projects.
President and Owner
Prologika, LLC | Making Sense of Data
Microsoft Partner | Gold Data Analytics
EVENTS & RESOURCES
Prologika: Applied Power BI Service training by Prologika (online or instructor-led):
Atlanta BI Group: Enhancing Data Analysis and Predictive Analytics with NoSQL by Cornell A. Emile on September 28th
Atlanta BI Group: ETL Architecture Reusable Design Patterns and Best Practices by Stephen Davis on October 26th