Posts

Handling Data Quality and Data Enrichment

Some of the most difficult and tedious issues in almost every BI project are handling data quality issues (aka exceptions) and data enrichment. Examples of data quality issues typically originate from wrong data entries or violated business rules, such as misspelled and misclassified products. Data enrichment tasks may go beyond the design of the original data source, such as introducing new taxonomies for products or customers.

In a recent project, a client needed guidance on where to handle data quality and enrichment processes. Like cakes and ogres, a classic BI solution has layers. Starting upstream in the data pipeline and moving downstream are data sources, ODS (data staging and change data tracking), EDW (star schema) and semantic models.

In general, unless absolutely necessary, I’m against master data management (MDM) systems because of their complexity and cost. Instead, I believe that most data quality and enrichment tasks can be addressed efficiently without formal MDM. I’m also a big believer in tackling these issues as further upstream as possible, ideally in the data source.

Consider the following guidelines:

  1. There are different types of exceptions and enrichment requirements, and they range in complexity and resolution domain. Each case must be triaged by the team to determine how to best handle it by examining the data pipeline upstream to downstream direction and traversing the layers.
  2. As a best practice, exceptions and enrichments tasks should be dealt with as further upstream as possible because the further downstream the correction is made, the narrower its availability to consumers will be and more implementation effort might be required.
    1. Data Source – examples include wrong entries, such as wrong employe’s job start date or adding custom fields to implement categories requires for standard hierarchies. The team should always start with the source system in attempt to address data quality issues at the source.
    2. ODS – examples include corrections not possible in the upstream layers but necessary for other consumers besides EDW.
    3. EDW – examples include corrections applicable only to analytics, such as IsAttendance flag to flag which attendance events should be evaluated.
    4. Semantic layer – calculated columns for data enrichment best done in DAX and additional fields the end user needs in a composite model.
  3. Shortcuts – When resolving the change upstream is time prohibitive, it should be permissible to handle the change further downstream while waiting for changes to an upstream system. For example, if fixing an issue in the data source is rather involved, a temporary fix can be done in ODS to prototype the change and expedite it to consumers. However, the team will remove that fix once the source system is enhanced.
  4. When consumers prefer access to uncorrected data, a pair of fields could be introduced as upstream as possible, such as:
    1. EmployeeId – raw field
    2. EmployeeIdCorrected – corrected field. Joins to the related dimension table will be done on this field.

data quality issues

Improving Data Quality

Poor data quality is one of the most challenging hurdles for successful data analytics. The story typically goes like this. The client has a flexible line of business system that people tend to use and abuse, such as entering garbage in free-form text fields that shouldn’t be free-form to start with. There is no data stewardship or data entry oversight. I’m not blaming anyone because people go about their business trying to get something done as fast as possible and BI is not on their minds. The issue is further complicated by acquisitions that usually bring more systems and more garbage. You and I’ve been there…the bigger and more complex the organization, the bigger the mess.

How do we solve this challenge? Should we buy a super-duper master data management (MDM) system to solve the data quality mess? Or should we kick the can down the road and tackle data quality in downstream systems, such as the data warehouse? In some cases, MDM can help but based on my experience, MDM systems are usually unnecessary. First, solving data quality issues down the road is probably a bad idea. Second, MDM systems bring additional complexity. ETL needs to be implemented to bring the data in and out, a business user must be delegated to correct data issues, another system needs to be learned and maintained, not to mentioned licensing costs. Instead, what’s really needed is to address data quality as upstream as possible, and that is the main data sources, such as the ERP system. The process is simple but tedious. Here are a few steps to get your team started on that road:

  1. Start by cleaning up the hierarchies because they are fundamental for top-level analysis. Request Management to come up with a standardized nomenclature in the form of hierarchies. What do we want our product hierarchy (Product Category, Subcategory,…) to look like? What do we want the Customer hierarchy to look like? Geography? Organization?
  2. Identify major data quality issues that deviate from the standard nomenclature.
  3. For each issue, determine how to address the issue. Can it be fixed by overwriting the data entry or coming up with a new custom field? Can the changes be automated or done in bulk, such as updating the underlying table directly? For example, if the country is misspelled, the best solution will be to go that invoice or customer and change the country. Or, if a certain hierarchy member is completely missing, such as continent or territory, add a new custom field.
  4. Assign the list of the data entry issues identified in step 2 to a data entry person to fix. Or, if they can be fixed in bulk, such as by updating the underlying tables, ask your helpful DBA to make the changes.
  5. If possible, ask the developer in charge to modify the source system to improve data quality, such as by using dropdowns instead of free-form entries for important fields and adding data entry rules.

I know this might be too simplistic. For example, the ERP might disable data entries once an invoice is approved, and I’ve seen clients adding data correction tables in data marts to get around this. If updating the data directly in the underlying tables is an option, this might be your best bet. However, if compliance or other regulations prevent this from happening, then the data must be fixed outside.

However, the main principle remains that data quality must be tackled as close to the source as possible. And improving data quality is your best investment and prerequisite to enable effective data analytics.

Using DQS for Externalizing Data Validation Rules

Developers spend plenty of time coding for exceptions. The customer SSN must be 9 characters. The organization code must be four characters and it must be constrained to specific values. The happy path is easy, coding for exceptions is tedious. Some exceptions are easy to catch, others might require more complicated logic than implementing the happy path. And, it gets more difficult if the rules change often or your BI system needs to support adding new rules.

This is where Data Quality Services (DQS) can help. To me, the biggest strength of DQS is externalizing rules. Instead of hardcoding rules in SQL or ETL, you can externalize them and even offload rule management to business users. Using the Data Quality Client tool, a business user can define simple rules and corrections, such as that the e-mail address must match a specific regular expression. Then, the ETL developer can integrate the data flow with DQS to separate invalid and valid rows. Moreover, DQS will tell you exactly which rule is broken and a confidence factor for matching. This helps you avoid tedious coding and saves your company a lot of effort.

In its first release, DQS has left substantial areas for improvement which I hope SQL Server 2016 will address, including:

  1. I’m sure that there is a solid academic theory behind the “domains” concept, but requiring the user to create composite domains for simple-cross field validations might too much to ask. Instead, a better approach might be to have a hidden composite domain consisting of all fields.
  2. The lookup lists, such as “Value Is In” cannot be data-driven, requiring you to hardcode lookup values.
  3. More advanced rules are needed, such as IF-THEN, without requiring composite domains.
  4. Very slow performance with the SSIS DQS task because it uses a single thread.
  5. Currently, DQS has no API so automation is not possible.

Announcing Prologika Validator

Data is your biggest asset and today’s currency but the data can be messy. You know it and we’ve seen it – many times. The messier the data, the bigger the headache. At Prologika, we believe that data quality issues should be addressed as early as possible and the validation process should start with the source systems. You need a solution to detect data entry issues without ETL, custom validation code, and exhaustively hard-coded rules. This is why I’m excited to announce today the availability of Prologika Validator!

Available as a cloud service and on-premises offering, Prologika Validator uses predictive analytics to find data anomalies in order to help you improve data quality. It analyzes historical data and discovers common data patterns and relationships. When the user attempts to save a data entry that is outside the norm, Prologika Validator can detect the outlier. Then, the source system can notify the operator about a potential data entry error. As a result, the data is cleaned at the source and wrong data doesn’t end up in downstream systems, such as a data warehouse. Suppose your company buys some products from vendors. The operator responsible for the order entry creates an order in the ERP system, such as Oracle Financials or Microsoft Dynamics. The system might not be capable of detecting a wrong field entry or a wrong combination of fields, such as the user has specified a wrong unit of measurement for the product purchased. However, Prologika Validator is capable of detecting such outliers by predicting the likelihood that the order entry is outside the norm. The validation process is very fast and the operator is notified immediately.

Sounds interesting? Visit the Prologika Validator page and give it a try now. If you want to know how Prologika Validator can help you improve data quality, contact us to discuss your requirements and get a pricing proposal.