Posts

Improving Data Quality

Poor data quality is one of the most challenging hurdles for successful data analytics. The story typically goes like this. The client has a flexible line of business system that people tend to use and abuse, such as entering garbage in free-form text fields that shouldn’t be free-form to start with. There is no data stewardship or data entry oversight. I’m not blaming anyone because people go about their business trying to get something done as fast as possible and BI is not on their minds. The issue is further complicated by acquisitions that usually bring more systems and more garbage. You and I’ve been there…the bigger and more complex the organization, the bigger the mess.

How do we solve this challenge? Should we buy a super-duper master data management (MDM) system to solve the data quality mess? Or should we kick the can down the road and tackle data quality in downstream systems, such as the data warehouse? In some cases, MDM can help but based on my experience, MDM systems are usually unnecessary. First, solving data quality issues down the road is probably a bad idea. Second, MDM systems bring additional complexity. ETL needs to be implemented to bring the data in and out, a business user must be delegated to correct data issues, another system needs to be learned and maintained, not to mentioned licensing costs. Instead, what’s really needed is to address data quality as upstream as possible, and that is the main data sources, such as the ERP system. The process is simple but tedious. Here are a few steps to get your team started on that road:

  1. Start by cleaning up the hierarchies because they are fundamental for top-level analysis. Request Management to come up with a standardized nomenclature in the form of hierarchies. What do we want our product hierarchy (Product Category, Subcategory,…) to look like? What do we want the Customer hierarchy to look like? Geography? Organization?
  2. Identify major data quality issues that deviate from the standard nomenclature.
  3. For each issue, determine how to address the issue. Can it be fixed by overwriting the data entry or coming up with a new custom field? Can the changes be automated or done in bulk, such as updating the underlying table directly? For example, if the country is misspelled, the best solution will be to go that invoice or customer and change the country. Or, if a certain hierarchy member is completely missing, such as continent or territory, add a new custom field.
  4. Assign the list of the data entry issues identified in step 2 to a data entry person to fix. Or, if they can be fixed in bulk, such as by updating the underlying tables, ask your helpful DBA to make the changes.
  5. If possible, ask the developer in charge to modify the source system to improve data quality, such as by using dropdowns instead of free-form entries for important fields and adding data entry rules.

I know this might be too simplistic. For example, the ERP might disable data entries once an invoice is approved, and I’ve seen clients adding data correction tables in data marts to get around this. If updating the data directly in the underlying tables is an option, this might be your best bet. However, if compliance or other regulations prevent this from happening, then the data must be fixed outside.

However, the main principle remains that data quality must be tackled as close to the source as possible. And improving data quality is your best investment and prerequisite to enable effective data analytics.

Using DQS for Externalizing Data Validation Rules

Developers spend plenty of time coding for exceptions. The customer SSN must be 9 characters. The organization code must be four characters and it must be constrained to specific values. The happy path is easy, coding for exceptions is tedious. Some exceptions are easy to catch, others might require more complicated logic than implementing the happy path. And, it gets more difficult if the rules change often or your BI system needs to support adding new rules.

This is where Data Quality Services (DQS) can help. To me, the biggest strength of DQS is externalizing rules. Instead of hardcoding rules in SQL or ETL, you can externalize them and even offload rule management to business users. Using the Data Quality Client tool, a business user can define simple rules and corrections, such as that the e-mail address must match a specific regular expression. Then, the ETL developer can integrate the data flow with DQS to separate invalid and valid rows. Moreover, DQS will tell you exactly which rule is broken and a confidence factor for matching. This helps you avoid tedious coding and saves your company a lot of effort.

In its first release, DQS has left substantial areas for improvement which I hope SQL Server 2016 will address, including:

  1. I’m sure that there is a solid academic theory behind the “domains” concept, but requiring the user to create composite domains for simple-cross field validations might too much to ask. Instead, a better approach might be to have a hidden composite domain consisting of all fields.
  2. The lookup lists, such as “Value Is In” cannot be data-driven, requiring you to hardcode lookup values.
  3. More advanced rules are needed, such as IF-THEN, without requiring composite domains.
  4. Very slow performance with the SSIS DQS task because it uses a single thread.
  5. Currently, DQS has no API so automation is not possible.

Announcing Prologika Validator

Data is your biggest asset and today’s currency but the data can be messy. You know it and we’ve seen it – many times. The messier the data, the bigger the headache. At Prologika, we believe that data quality issues should be addressed as early as possible and the validation process should start with the source systems. You need a solution to detect data entry issues without ETL, custom validation code, and exhaustively hard-coded rules. This is why I’m excited to announce today the availability of Prologika Validator!

Available as a cloud service and on-premises offering, Prologika Validator uses predictive analytics to find data anomalies in order to help you improve data quality. It analyzes historical data and discovers common data patterns and relationships. When the user attempts to save a data entry that is outside the norm, Prologika Validator can detect the outlier. Then, the source system can notify the operator about a potential data entry error. As a result, the data is cleaned at the source and wrong data doesn’t end up in downstream systems, such as a data warehouse. Suppose your company buys some products from vendors. The operator responsible for the order entry creates an order in the ERP system, such as Oracle Financials or Microsoft Dynamics. The system might not be capable of detecting a wrong field entry or a wrong combination of fields, such as the user has specified a wrong unit of measurement for the product purchased. However, Prologika Validator is capable of detecting such outliers by predicting the likelihood that the order entry is outside the norm. The validation process is very fast and the operator is notified immediately.

Sounds interesting? Visit the Prologika Validator page and give it a try now. If you want to know how Prologika Validator can help you improve data quality, contact us to discuss your requirements and get a pricing proposal.