In the following article I point out where companies tend to go wrong in their initial use of extract, transform and load (ETL).
I see it all the time when reviewing an enterprise’s data integration architecture: data quality processing that’s a tangled mess or totally nonexistent. Why is this common? People either fail to do data profiling to understand the true extent of data quality issues in the source systems, or they assume any data quality issues are the sources’ problem, not theirs.
The latter is a weak excuse. Data quality is everyone’s problem. You need to get the business requirements for data quality metrics, profile your data sources to create a baseline, build the metrics into your ETL processes, monitor them, report on the data quality levels and finally work with the business users if corrective action is needed.
I’ve found that many enterprises’ first experience with ETL tools is simply to use them to execute SQL code or stored procedures. Using the ETL tool as a scheduling process to run custom SQL scripts or procedures is a waste of an essential tool in your data warehouse (DW)/business intelligence (BI) implementations. With this approach you get none of the benefits of the ETL tool – no workflow, no data lineage, no reuse and no productivity boost – and you’ve probably been slowed down in writing the custom SQL code.
Although they’re using the tool, these firms don’t understand ETL best practices and the data integration paradigm it makes possible. The first inclination of developing source-to-target mappings, especially if you come from the custom SQL coding world, is to try to do everything in one operation.
The reality is that stored procedure is a series of steps contained in one physical program. The ETL workflow is the analogous container, but it can do much more. Reuse of common error and exception processing, conditional logic and parallel processing are all examples of what you may build into that ETL workflow.
Also, when you build custom code invoked by the ETL workflow, it is just like a “black box” to the workflow. If the same processing was created using the ETL tool, then the process would be documented, repeatable and visible to your data integration processes from the data lineage and metadata management capabilities of the ETL tools.
When any of your enterprise applications transfer or load data, there are always (or should be) operational controls for how it is defined, implemented and monitored. But, too often this same level of operational control is not implemented in the initial loading of your DW or subsequent data marts.
Why is this? Let’s take the DW and data mart as separate topics. For the DW, often the ETL developers were not from the “old guard” enterprise applications group, but rather a new crop of developers experienced in database development. These new ETL developers do not have experience in operational processing. In addition, no one from the business is asking for these controls because they don’t know they’re supposed to. When the DW does go into production and there are problems, everyone is surprised that these controls were not built in.
With regard to creating and loading data marts, there is often a much more relaxed approach to development. Many times data marts are built with custom code, but even when an ETL tool is used, data marts are built as if they were “sandboxes” (a testing environment that isolates untested code changes and outright experimentation from the production environment). Too often, they are not viewed as production applications that facilitate real business decisions – but they are! If you build data marts and businesspeople are using them for decision-making, then they need all the same operational control and ETL best practices as the data warehouse.
There are two opposing views of ETL processing. First, ETL processing is a paradigm shift from custom SQL coding. ETL tools offer many benefits to the business by establishing data services that can be tapped throughout the enterprise and offer benefits to IT through productivity gains. If you have bought the ETL tool, you need to change how you do things to get the benefits.
Second, the more things change, the more they stay the same. What does not change with the use of ETL tools is the need for operational processing best practices for your ETL procedures. These operational controls and practices have evolved over the years and should be leveraged in any new data integration processes. After all, implementing the highest quality applications and data is not something new.