You’re involved in a business intelligence project, making great progress creating reports and cubes for analysis for your business users. They’re excited about what they see and can’t wait until they have the system up and running.
But then something happens in user acceptance testing. Some “funny” data values appear, and you can’t aggregate the data or drill down correctly. You find the data problem and get around it, but immediately hit another anomaly. Pretty soon, the business users feel there is a severe data quality problem and your project goes from the stratosphere to crash-and-burn. What happened?
Many business intelligence projects fail simply because the project team was surprised by the data – specifically, they were surprised by data anomalies lurking in the source systems. Sometimes the problems can be blamed on data quality, but often there is no problem with the correctness of data. Rather, the problems lie with the data’s completeness or the way specific fields are being used. A field might be used for multiple purposes depending on the business process or division that updated it, but those multiple uses are not documented, so it ends up looking wrong.
Why do teams get caught off guard? Chances are the project managers did source systems analysis, so there shouldn’t be any surprises, right? To answer this, let’s look at the usual approach a manager takes. The team gathers the business requirements, determines the data needed to fulfill those requirements and determines the system of record (SOR) needed to source the data. After that, the team assesses the SOR by obtaining field layouts, talking to the IT staff supporting the SOR and finally writing queries to fetch information about the SOR contents. The next step is to write the ETL code, test it and finally start loading the data warehouse. So far, so good. The data starts getting loaded and the business intelligence (BI) portion of the project – reports, cubes, dashboards, etc. – gets built. Then, a funny thing happens when the business users use the BI environment. Numbers don’t seem to match what they are reporting on and some “silly” results appear.
The problem is not that source system analysis was not undertaken. The problem is that it was neither comprehensive nor complete. Examining the source systems for data anomalies is a process that most of us do not have much experience with. Also, most of us are not allowed enough time in the project to build all the SQL scripts needed to examine the data thoroughly and systematically.
Putting an End to Funny Data
What’s a project manager to do? The manager should start thinking about data profiling tools. To profile data effectively, you’ve got to understand the source system data – how it’s defined, its condition, and how it’s stored and formatted. When all the source systems were back-office, custom, legacy systems, this was a huge task requiring a lot of people and time. Widespread ERP (enterprise resource planning) system deployments replaced many of the customized legacy systems, resulting in a better understanding of data sources. With ERP implementations, you either have documented data or implementation resources people who are familiar with the data. However, what if your company lacks knowledge or documentation of your source systems? That’s when you need data profiling software so you can examine and document data sources, rather than use a people-intensive investigation of these sources.
Data profiling tools automate a methodology and set of algorithms that systemically examine your source systems. These tools generate many more data queries than your staff could perform to examine data vales, data ranges, frequency distribution, broken relationships and many other characteristics that are indicative of potential data anomalies. These tools encode the experience that your staff probably lacks when it comes to deeply examining data for problems. These tools assist your staff in truly understanding not just the definitions of what the data is supposed to represent, but more importantly, the real content of that data.
Data profiling should be established as a best practice for every data warehouse, business intelligence and data migration project. Data profiling is essential to establishing a data quality process. In addition to performing data profiling during your project, it should be an ongoing activity to ensure that you maintain data quality levels.
Early detection is best, and the further in the process you are, the more expensive the errors are to correct. Certainly, too many BI and DW projects have missed their deadlines or budget because of data quality surprises late in the project. Data profiling activities benefit these projects by improving project planning and its financial performance. It should be a requirement early in the project, just as systems testing and user acceptance testing occur toward the project’s completion.
Get Serious About Data Profiling Tools
Unfortunately, data profiling is not yet widespread. Data profiling tools haven’t received the market exposure of dashboards or scorecards, which are splashy and get people excited; however, without solid data quality, those graphs on the dashboards are useless. People find it hard to get excited about a data profiling tool sales pitch that talks about generating billions of queries to tell you how bad you data is. Nobody likes to be told their data has problems – especially when you can see cool bar charts, scatter plots and red/yellow/green lights displayed in your new dashboard!
Data profiling should be established as a best practice and should be required for every data warehouse, BI or data migration project. In addition, a data profiling tool should be considered to automate those processes. You will not be successful if your data quality is suspect, so you must understand your data.