The narrow definition of data quality is that it’s about bad data – data that is missing or incorrect. A broader definition is that data quality is achieved when a business uses data that is comprehensive, consistent, relevant and timely. If you focus only on the narrow data definition you may be lulled into a false security when, in fact, your efforts fall short. We will address several more misconceptions about data quality.
In order to fix a problem you have to recognize you have a problem. According to recent Gartner research, 25 percent of Fortune 1000 companies are working with poor quality data. The Data Warehousing Institute (TDWI) estimated that data quality problems cost U.S. businesses $600 billion each year. Regulatory initiatives such as Sarbanes-Oxley and Basel II dictate that companies must provide transparent data. But even with the documented high costs of poor data quality and the tight regulatory environment, many companies are turning a blind eye to their data quality problems. Why? Perhaps it is because of their mistaken belief that bad data is the only data quality issue they need to worry about.
A corollary to the above: to fix a problem you first have to take responsibility for it. That’s the rub. Taking responsibility is the biggest roadblock to dealing with data quality. In order to achieve a high level of quality, data has to be viewed from an enterprise and holistic perspective. Data may be correct within each data silo, but the information will not be consistent, relevant or timely when viewed across the enterprise. To make matters worse, you’ve got each report or analysis interpreting the data differently, so even when the numbers start off the same in each silo, the end results will not be consistent. Data is a corporate asset and has to be consistent across the entire corporation, not just within the business function or division where it originated.
Fixing implies that there was something wrong with the original data, and you can fix it once and be done with it. In reality, the problem may have been not with the data itself, but rather in the way it was used. When you manage data you manage data quality. It’s an ongoing process. Data cleansing is not the answer to data quality issues. Yes, data cleansing does address some important data quality problems has and offers a solid business value ROI, but it is only oneelement of the data quality puzzle. Too often the business purchases a data cleansing tool and thinks the problem is solved. In other cases, because the cost of data cleansing tools is high, a business may decide that it is too expensive for them to deal with the problem.
Data quality is a company problem that costs a business in many ways. Although IT can help address the problem of data quality, the business has to own the data and the business processes that create or use it. The business has to define the metrics for data quality – its completeness, consistency, relevancy and timeliness. The business has to determine the threshold between data quality and ROI. IT can enable the processes and manage data through technology, but the business has to define it. For an enterprise-wide data quality effort to be initiated and successful on an ongoing basis, it needs to be truly a joint business and IT effort.
Data entry or operational systems are often blamed for data quality problems. Although incorrectly entered or missing data is a problem, it is far from the only data quality problem. Also, everyone blames their data quality problems on the systems that they sourced the data from. Although some of that may be true, a large part of the data quality issue is the consistency, relevancy and timeliness of the data. If two divisions are using different customer identifiers or product numbers, does it mean that one of them has the wrong numbers or is the problem one of consistency between the divisions? If the problem is consistency, then it is an enterprise issue, not a divisional issue. The long-term solution may be for all divisions to use the same codes, but that has to be an enterprise decision.
The larger issue is that you need to manage data from its creation all the way to information consumption. You need to be able to trace its flow from data entry, transactional systems, data warehouse, data marts and cubes all the way to the report or spreadsheet used for the business analysis. Data quality requires tracking, checking and monitoring data throughout the entire information ecosystem. To make this happen you need data responsibility (people), data metrics (processes) and meta data management (technology). (We’ll addresshow in a future column.)
In an ideal world, every report or analysis performed by the business exclusively uses data sourced from the data warehouse – data that has gone through data cleansing and quality processes and includes constant interpretations such as profit or sales calculations. If everyone uses the data warehouse’s data exclusively and it meets your data quality metrics then it is the single version of the truth.
However, two significant conditions lessen the likelihood that the data warehouse solves your data quality issues by itself. First, people get data for their reports and analysis from a variety of data sources – data warehouse (sometimes there are multiple data warehouses in an enterprise), data marts and cubes (that you hope were sourced from the data warehouse). They also get data from systems such as ERP, CRM, and budgeting and planning systems that may be sourced into the data warehouse themselves. In these cases, ensuring data quality in the data warehouse alone is not enough. Multiple data silos mean multiple versions of the truth and multiple interpretations of the truth. Data quality has to be addressed across these data silos, not just in the data warehouse.
Second, data quality involves the source data and its transformation into information. That means that even if every report and analysis gets data from the same data warehouse, if the business transformations and interpretations in these reports are different then there still are significant data quality issues. Data quality processes need to involve data creation; the staging of data in data warehouses, data marts, cubes and data shadow systems; and information consumption in the form of reports and business analysis. Applying data quality to the data itself and not its usage as information is not sufficient.
Ditto what I said for Misconception #4.
Ditto what I said for Misconception #4.
Yes, standardizing on BI tools can save money and may be a worthwhile project. But, don’t lose sight of the fact that the use of different BI tools is a symptom of a data quality problem, not the cause. If you pull the same data and implement the same transformations (formulas) in different BI tools you get the same results. The report, chart or dashboard may look a little different, but the numbers would be the same. The problem, therefore, is not that different BI tools are being used, but that each project implementing these tools built a different data mart or cube and then applied different formulas in their reports or analysis. Using the same BI tool in different projects that use different data with different transformations is still going to yield different results – and hence the data quality issues still remain. The cause of the data quality issues was the lack of consistency between the data used and data transformations, not the use of different BI tools.
Data quality is defined as comprehensive, consistent, relevant and timely data for use by the business. Don’t shrug it off as issue of bad data entry. Data needs to be addressed on an enterprise scale and in a holistic manner incorporating people, processes and technology.