In The Trial-and-Error Method for Data Architecture I discussed some common data architecture mistakes. In this article, I point out areas where companies tend to go wrong in data integration or extract, transform and load (ETL) processing.
The usual development approach for data integration is to gather the data requirements, determine what data is needed from source systems, create the target databases such as a data warehouse and then code. This is an incomplete, bottom-up approach. It needs to be coupled with a top-down approach that emphasizes an overall data integration architecture and workflow.
Some of the design considerations often lost with a bottom-up approach include:
Failing to address these considerations in the beginning can delay your project, increase costs, reduce perceived data quality and cause business users to question the value of your business intelligence (BI)/data warehousing (DW) efforts.
People often assume that data quality problems are simply data errors or inconsistencies in the transactional systems that can be fixed with data quality products. They overlook and, therefore, don’t try to prevent the fact that problems arise when you integrate data from disparate source systems into a data warehouse.
They’re not seeing that many data quality problems are really data consistency and integrity issues that arise when you integrate data from multiple transaction systems. The differences in dimensional data, such as product (part IDs, code and hierarchy), customers (business and/or people), suppliers, partners and employees become an issue on the enterprise level, which is precisely the target for DW and performance management reporting and analysis.
Even when data quality problems are not your fault, you still need to take responsibility to proactively measure, monitor and report on data quality metrics as you load your DW and data marts. You might not own fixing the problems, but you certainly own measuring them. The business should not make decisions using enterprise data of unknown quality.
Measure data quality by obtaining data quality requirements in your initial business requirements phase, incorporating data quality metrics into your architecture, monitoring those metrics in all your data integration processes and reporting on data quality so the business users understand the data on which they are basing their decisions.
While most large enterprises have embraced ETL development as a best practice, the reality is that custom coding is still prevalent, especially in the following areas:
In all the cases just mentioned, I get the usual feedback. “Rick, how can it be faster to code with the ETL tool, and how can the ETL code perform as well as the code I create?” Without debating custom coding versus the merits of ETL development, let’s just say that when I see a large block of custom code, I see a great opportunity for change and improvement. But custom coding is often below the radar because most just assume that everyone is using ETL tools.
In general, people tend to take an oversimplified view of data integration. By not seeing it as a process to be managed from a top-down perspective, they tend to run into problems and learn the hard way – through trial and error.