Back to the Basics of Data Warehousing

Is data integration the Achilles heel of Software as a Service?
June 11, 2007
“Tuck-in” or “Sent-out” Acquisitions: IBM purchases Data Mirror
July 17, 2007
Show all

Back to the Basics of Data Warehousing

published in Information ManagementBusiness demand for information is insatiable, driven by performance management, competitive pressures, industry regulations and the exchange of data with suppliers, partners and customers. Likewise, data integration becomes a challenge as companies deal with diverse sources, huge volumes of data and demand for real-time results.

The good news is that data integration capabilities have expanded over the years through software improvements and infrastructure advancements. In software, extract, transform and load (ETL) has emerged as the data integration workhouse with enterprise application integration (EAI), enterprise information integration (EII) and service-oriented architecture (SOA) integrated into powerful data integration suites. Infrastructure advances in multiprocessor central processing units (CPUs), storage arrays, disk input/output (I/O), network bandwidth and database throughput have greatly increased the volume of data that businesses can process. The bad news is that despite these advances, enterprises have not been able to keep up with business information demands. And, many simply cannot afford it.

There are two fundamental traps enterprises fall into that constrain data integration efforts regardless of how much they have to invest. These traps are the “silver bullet” syndrome and the “if it’s old, it’s no good” syndrome.

The Silver Bullet

In the dawning age of data warehousing, ETL tools were simply code generators. Their high cost and low functionality limited their use. IT initially custom coded all data integration applications. The best data integration coders had an expert knowledge of database development, tuning and optimization. Databases were years away from the self-tuning and optimization that people take for granted today.

Now, ETL and database optimization are more advanced, but as a result, IT’s in-depth understanding of data integration and database internals have declined. Most of the people involved in data integration today do not have the same understanding of data integration and databases – and with today’s fancy tools, they don’t need it. So, when the business needs more information IT looks for a silver bullet: buy more complex data integration software and infrastructure. However, not every enterprise has the budget, and even those that have money to throw at the problem find that either they can’t keep up with demand or the new tools don’t meet their data integration needs.

If It’s Old, It’s No Good

There are two fundamental principles for designing a data architecture and maximizing data throughput:

1. Process the minimum amount of data that is required to keep data up to date.
2. Load the data as quickly as possible into the database you are using for data

Despite all the advances that have occurred during the last two decades in data integration technology, databases and infrastructure, these two principles still apply. But somewhere along the way people have forgotten them – or perhaps, they never learned them in the first place. They rely on their data integration suites and databases to load data quickly, and when they run into bottlenecks, they buy faster CPUs, more memory and speedier disks. But all they really have to do is follow these two principles – it’s a lot less expensive.

People try to make up for forgetting the fundamentals by making larger software and hardware investments, but they cannot meet the demands of an avalanche of business information.

Back to Basics

The best way to speed data throughput is to gather only the minimum data needed to update your data warehouse or operational data store (ODS). It is a best practice to implement this through change data capture (CDC) logic, but many data warehouses and ODSs are built using complete data reloads. Some of this processing is a holdover from data warehouses and ODSs built years ago. These data warehouses have become legacy applications handicapped by their complete reloads, and IT has been reluctant to rework these data warehouses using CDC.

Many companies aren’t just building their legacy data warehouses from scratch every load cycle, they also have data marts and cubes they rebuild every time they use them. It’s time to consider breaking the cycle and modernizing your data warehouse/business intelligence (BI) load cycle.

Bulk loading – the arcane and unglamorous database loading method – is another old approach to break the data integration bottleneck that helps avoid buying new software and infrastructure. The principle is to get the data out of your source systems and into your data warehouse environment as quickly as possible. Generally, this approach is a very quick and inexpensive method to significantly improve loading the data warehouse. Bulk loading only needs to be applied to your top bottlenecks, which are generally fact tables (in dimensional modeling) are we typically only about 10 percent of the tables/files that you are loading. It is interesting to note that even the high-end data integration suites have made accommodations to bulk loaders, further reinforcing that it is indeed a viable and valuable tool.

There are other approaches, methods or techniques that can also be applied from the “dark days” when data warehousing was emerging. The laws of database and data integration still apply.

Leave a Reply

Your email address will not be published. Required fields are marked *