Essential Steps in the Data Integration Process

Understanding the DIF Information Architecture
December 13, 2003
Standards Let Us Play Nice Together
February 13, 2004
Understanding the DIF Information Architecture
December 13, 2003
Standards Let Us Play Nice Together
February 13, 2004
Show all

Essential Steps in the Data Integration Process

published in Information ManagementThe business users are restless. They’re looking to you for a solution that’s long overdue. It’s tempting to rush and buy technology to solve your problems, but resist the urge until you have determined what you need to do. Then figure out how.

People usually oversimplify data integration by assuming it involves only extract, transform and load (ETL) tools. Though critical, an ETL tool is just one piece of a complex puzzle. The data integration framework (DIF) encompasses two categories of processes. The first category is the process to determine your data requirements and solution. The second is the process used to physically gather the data from its sources and transform it into information that businesspeople can use to analyze and make decisions.

Determining your data requirements includes:

  • Gathering business requirements
  • Determining data and quality needs
  • Data profiling or understanding data sources and associated quality both in the source system and across multiple source systems, if applicable
  • Performing a data quality assessment against the metrics the business has requested
  • Defining the gap between what data is available and its quality versus what the business has requested
  • Revising business expectations or project costs and determining the selected data solution
  • Modeling the data stores necessary ­– staging areas, data warehouse, operational data store and data mart(s) –­ both from a logical perspective (to confirm business requirements) and a physical perspective (to enable implementation).

At the end, you’ll know what data to source and store.

Regularly updated documentation is critical during this process. The integration process, both from an information technology (IT) and business perspective, must be documented, verified and made available to everyone building and using the systems. In order to share this information with business users, it is highly recommended that you have a writer “translate” the IT jargon and system-specific transformations into something business users can understand.

Data profiling is a time-consuming and often underestimated task in data integration. Considering that many businesses have much of their operational data in a standard enterprise resource planning (ERP), customer relationship management (CRM) or supply chain management (SCM) package, the definition and quality of the source data should be easily available from the people that designed and deployed these systems. If the package has been highly customized or the development team has scattered, data discovery may not be so easy. Likewise, if the source data is located in custom-built legacy systems or in a vendor’s application package that was implemented years ago, data source discovery will be difficult.

Once the data requirements are “finalized” (things change, which is why documentation and revisions are crucial), we can move to the second category of processes. The steps to physically gather the data from its sources and transform it into information are: data preparation, data franchising, meta data management and data management.

Data preparation, the most complex of these processes, includes gathering, reformatting, consolidating, transforming, cleansing and storing data both in staging areas and the data warehouse. It’s best to incorporate the data quality or cleansing activities into the initial process workflow. Many people either assume that the quality of their data is fine or that data quality is the transactional system’s problem. The reality is that there are many inconsistencies between source systems that are not apparent to the people who use the individual systems. It is only when data is consolidated that quality comes into question. In addition, business analytics may require dimensions, measures, or metrics that are not critical in data creation (e.g., product category). In these cases, it is important to define the processes to handle these situations in the data preparation processes. The end result of this stage is a data warehouse where information is stored and managed for “downstream” use by businesspeople.

Data franchising is the process of reconstructing data into information that is usable by businesspeople for reporting and analysis. The data from the data warehouse may be filtered, aggregated, summarized, staged and stored in a data mart where it will be accessed by the businesspeople using business intelligence (BI) tools. These processes may be referred to by the IT staff as denormalization. It simply means the data has been reconstructed to allow the BI tools to better handle and manipulate the data.

Data and meta data management are processes behind the scenes that pass the data and its definitions (meta data) between the other processes. Data and meta data management are often considered by both the IT staff and vendors to be weak cousins to the data preparation and data franchising processes.

The more attention paid to the processes, the more robust, maintainable and understandable the data is both to IT and businesspeople.

1 Comment

  1. Aniruddha says:

    This is really interesting. It would be nice to get some Business domain specific examples/Used cases of ETL

Leave a Reply

Your email address will not be published. Required fields are marked *