Data preparation is the core set of processes for data integration. These processes gather data from diverse source systems, transform it according to business and technical rules, and stage it for later steps in its life cycle when it becomes information used by information consumers.
Don’t be lulled by the silver bullet a vendor or a consultant may try to sell you, saying that all you need to do is point a business intelligence (BI) tool at the source system and your business users will be off and running (supposedly with real-time data, too!).
That oversimplifies data integration into a connectivity issue alone. If it was that easy, every ERP vendor would be offering modules to do just that, rather than have you struggle for years to produce the reports that your business users are requesting. In addition, ERP vendors would not be continually evolving data warehouse (DW) solutions over the years if data integration was merely a connectivity problem.
The reason the solution is not just point and click (using the BI tool) is because data preparation involves many steps that may be very complex depending on how your business operates, as well as how your applications are implemented within your enterprise. Physically accessing the data in source systems may be easy, but transforming it into information is another story altogether (and something you don’t appreciate if you have only been dealing with proofs of concept or single business solutions rather than an enterprise-wide, real data solution).
The steps involved in data preparation are: gathering, reformatting, consolidating and validating, transforming (business rules and technical conversions), performing data quality checks and storing data. Figure 1 depicts the data preparation processes in a logical progression. Although the sequence is accurate, the physical implementation may involve a combination of many of these processes into a sequence of source-to-target mapping operations or workflows. Also, although the diagram implies that you’re storing intermediate results in a database, it’s only a logical representation. Intermediate mappings may involve the use of memory or temporary tables based on performance and efficiency considerations rather than persistent storage in database tables.
Figure 1: Data Preparation Processes
Data preparation processes easily capture the lion’s share of the work of any DW or BI project. Industry analysts estimate it consumes 60 to 75 percent of the project time. Project delays and cost overruns are frequently tied to underestimating the amount of time and resources necessary to complete data preparation or, even more frequently, to do the rework necessary when the project initially skimps on these activities and then data consistency, accuracy and quality issues arise.
More than ETL
Almost everyone – IT, vendors, consultants and industry analysts – associates data preparation with extract, transform and load (ETL) along with the tools in the marketplace that perform these processes. ETL tools focus on source (getting data from source systems) to target (putting data in data warehouses, operational data stores and data marts) mapping and associated transformations. Although the resulting ETL processes are part of the data preparation processes, they are only a subset. ETL tools are just that, tools to be used, and do not negate designing overall architecture and the processes to implement that architecture. People understand and include the gathering, reformatting, transforming, cleansing and storing data processes of data preparation, but they often fail to include or oversimplify the consolidation and data quality processes. You need to understand the data preparation processes, design an architecture to implement them and then use your ETL tool (and potentially other tools too) to deliver these processes to transform data into consistent, accurate and timely information that your business can use.
As we mentioned last month, data profiling is a critical process needed to understand the content of the source systems. Data profiling is a critical precursor to data preparation (i.e., you have to understand the data in order to prepare it for BI). In addition, data modeling is another process that complements data integration. When I was a software engineer, my first manager told me that there are two approaches to developing software: design and code or code and rework. The latter appears to be initially faster, but the former gets you there sooner with quality, consistent code. In addition, constant rework raises the overall cost and reduces the trust your user has in your code. “If we fail to plan, we plan to fail.”
ETL tools have been around for more than a decade, have grown significantly in functionality, and are offered by many vendors. When talking to anyone in the industry or reading articles, you would assume that everyone is using these tools on every DW project; however, Gartner states, “ETL tools are not commonplace in large enterprises. The majority of data warehousing implementations still rely on custom coding to deliver their ETL processes.”1
Despite the press releases and articles, why has custom coding dominated the DW landscape? Let’s divide the world into then and now. Many data warehouses were started years ago and are in their second or later generation. These data warehouses are already legacy applications with many of their tool decisions made based on the context of these tools when the DW was initially built. At that time, many of the ETL tools were very expensive, required the IT staff to learn yet another complex tool and, quite frankly, were limited in their functionality. Faced with these conditions, many DW project teams decided to create custom code to develop the ETL processes because it was quicker and cheaper (and they did not know how to justify the cost to their business). What many companies have experienced is that the total cost of ownership (TCO) that includes license fees, initial development costs and, most importantly, maintenance costs has shifted the argument to favor using ETL tools over custom coding; and this does not even account for many additional benefits of IT productivity, source code maintenance, impact analysis and meta data management.
Data preparation is not just about extracting and loading data. Ensuring data quality is often a critical activity that is required for the business to effectively use the data. This activity is always oversimplified and underestimated. It is often the activity that slows your project; or, if done inadequately, it is the activity that causes the most problems with the business users’ assessment of the usefulness of the data. As mentioned earlier, business requirements and expectations need to be developed with the business users to establish the critical success factors in this area. Most people consider data quality to be checking for error conditions, data format conversions and mappings, and conditional processing of business rules based on a combination of dimension values or ranges. These types of data quality processes can be adequately performed with an ETL tool if planned for and designed into the overall data-preparation architecture. However, a more important aspect of data quality is data consistency rather than simply error detection or data conversion. Data may be valid in its transactional form when brought into a data warehouse, but become inconsistent (and incorrect) when transformed into business information. Data preparation needs to take into account these transformations and ensure data quality and consistency. Tools alone will not enforce this consistency unless it is designed into the processes.
There are some instances of data quality that can be handled by tools. These areas are often categorized as data cleansing tools and manage a variety of complex data quality issues that many companies encounter. A significant example is customer names and addresses. A customer or supplier name (person or company) may be input in different variations such as IBM, IBM Corp. or International Business Machines Corporation. Many companies dealing with large customer lists, for example, have purchased specialized data cleansing software. Another example of data cleansing is the householding that financial institutions perform where they link household or family members’ personal and business accounts both for their customers’ convenience, as well as for their own convenience in promoting a full range of their services to their customers. There are other examples of industry-specific data cleansing tools that may be appropriate to use in addition to the ETL tools based on the business requirements relating to data quality.
Data cleansing tools are important tools in many companies’ arsenals in ensuring data quality and consistency. However, too often, many in our industry equate data cleansing tools with data quality. While the data cleansing tools provide data preparation processes for data quality, they are not the only processes that are needed in that area. Many see the data cleansing tool as a silver bullet. Others feel if they cannot afford those tools (or they are not applicable to their situation), then they can forget about data quality. You can never forget about data quality, and you have to build data quality processes throughout data preparation.
In future columns, we will explore various data integration and ETL issues such as: Should you hand code or use an ETL tool? Should you standardize on an ETL tool? Should you purchase a data cleansing tool? Should you purchase a pre-built solution for CPM? Surprisingly, conventional wisdom isn’t always the best solution for you. We will discuss the benefits, drawbacks and compromises involved with your choices. The goal is to separate fact from fiction and arm you with the knowledge to make an informed decision.
1. Gartner, “Slow Growth Ahead for ETL Tools Market,” November 7 2003, page 1.