Data standards ensure the consistency, accuracy, integrity and validity that are critical to your success. Data is not usually clean and consistent, especially when gathered from many source systems. Reference data (referred to as “dimensions” in data warehousing or “master data” in enterprise resource planning systems) such as customers, suppliers, organizations and products generally varies between divisions and also changes over time. Standards on how to handle integrating these dimensions and ensuring consistency are crucial to creating the “single version of the truth.” Often, it is not mistakes in the data entered in the transaction systems that cause poor data quality (although errors do occur), but rather a gap in understanding how to merge data from different systems and maintain dimensional integrity.
The narrow definition of data quality is that it’s about data that is missing or incorrect. A broader definition is that data quality is achieved when a business uses data that is comprehensive, consistent, relevant and timely. If you focus only on the narrow data definition, you may be lulled into a false security when, in fact, your efforts fall short.
Data consistency is often a very difficult area to tackle and sometimes derails data warehouse projects. For instance, different divisions or business functions within a company may have different definitions of profitability. This does not have to cause problems because each business function may have different metrics regarding profitability from their perspective and position within an enterprise. Recognizing these differences and “labeling” them as different metrics will help you avoid the task of getting everyone to agree on one definition of profit. The key is to have consistent “raw” data and documented business rules. If the data is handled consistently and documented as it is transformed into information, different groups can apply different profitability rules but trace the derivations back to the same “raw” data.
A sample of best practices regarding data standards includes the following:
Develop a data model for you data warehouse, operational data store and data marts. You do not need to develop an enterprise data model at first. In fact, I would suggest that you avoid it unless you have solid management backing, your business does not need any business return on investment and you have a very long time to deliver results. The key reasons for a set of data models are that this establishes a way to communicate and obtain feedback on what you are building, serves as a point of documentation and helps with version control.
Develop a data model for your data staging areas and document the processes. A lot of important data integrity, data quality and data transformation work happens within your data staging areas. If you do not model and document the data processing that occurs, how do you achieve the data transparency and consistency that your business needs? Documenting the process should be an easy task if you are using an extract, transform and load (ETL) tool. Creating the data model may not be as straightforward because the data structures are usually created by the ETL developers as they create their code. People may think it will slow their development efforts if they document the processes and data models; however, taking shortcuts is not acceptable in this case.
Define and track data quality metrics and issues. You need to determine how you will measure data quality, working with the business users and understanding how they define data quality. The data quality measures need to be instantiated and enforced in your ETL processing. In addition, create reports or dashboards to track your performance to those measures. You also need to examine the quality of the source systems. Data profiling tools are an effective way to both model and determine the state of the data quality with the systems of record (SOR).
Data cleansing. If applicable to your enterprise, you need to develop the data standardization rules and processes across different source systems. They key is consistency not developing the “perfect” definition. You may be able to use ETL tools to implement your standardization processes, but you might also need to need to investigate additional tools. This is one area where you may need to examine the tools for their capabilities to determine how much data cleansing you can actually implement or afford. Although it is possible to achieve standardization processes for name and address matching as well as householding processes used in the financial industry to associate family members and small businesses, you need to decide if it brings a sufficient return on investment. This is one area where expectations need to be matched with reality.
Data naming and definitions. Your project should have standard enterprise-wide names and abbreviations including industry-specific and company-specific algorithms. These standards can save you time in creating systems, improve communications and reduce maintenance time. In the real world, it’s not unusual to find that other IT systems either did not use a naming convention or each project team created their own. This means that the chances of having all IT systems following one naming convention are limited. If a set of standards exists, use the standards and do not worry about making them perfect. If the standards do not exist, create them quickly and enforce them. Consistency, not perfection, is important in achieving long lasting data quality.
All the latest and greatest tools in the world will not mask poor data quality. The old GIGO (garbage in, garbage out) expression aptly describes what happens to data warehouse and business intelligence projects that do not establish comprehensive data standards. These standards are like meeting building codes, you might be able to get away with violating the rules because no one immediately sees you doing so; however, your actions will come back to bite you in the long run.