Just like boot-cut jeans and fuel-efficient cars, big data warehouses are coming back in style. Take a trip down memory lane, when IBM introduced the corporate data warehouse (CDW) initiative in the early 1990s. The idea, which was admittedly very mainframe-centric at the time, was to put all your data for decision support onto one database. Naturally, the best platform for that database happened to be an IBM mainframe. Centralizing data, in what would years later be referred to as the “single version of the truth,” would enable businesspeople to gain an enterprise view of the business.
Being an early adopter always carries some risk, and this was no exception. CDW projects generally took longer than planned, ran over budget and underdelivered on expectations. And that was for the projects that actually were completed; many simply fizzled out. It is worth noting that many other big corporate projects in the 1990s encountered similar fates. There were scattered success stories, but, in general, the underwhelming results gave CDWs a bad reputation. It was inevitable, therefore, that there would be a backlash.
The next era was the rise of data marts. Data marts promised to be quicker and cheaper to build and provided many more benefits – including the benefit of actually being able to finish building it! Data marts can be an invaluable part of your data warehousing architecture; however, at that time they were primarily a backlash to the big, centralized CDW projects.
Rather than being part of an overall enterprise data architecture, data marts were the architecture. You could build them faster and cheaper because you did not take the enterprise into account. This meant you could take shortcuts and avoid the most difficult part of data warehousing – data integration. As a result, data marts were built for the parochial interests of their sponsors rather than for the enterprise.
Companies sprouted multiple data marts, each built only to accommodate their sponsor with their own data definitions, data transformation, reporting and technical platform. We moved from trying to achieve one version of the truth to building multiple single versions of the truth – an oxymoron. To top it off, many of these projects overpromised and underdelivered, just like the CDW projects.
Data mart projects exacted another toll. The tradeoff for being able to deliver cheaper and faster was to severely limit scope of the breadth and depth of data, its integration, its quality and the analytical capability offered to business users. Initially, data marts were perceived as a great success – until business groups realized they were debating in meetings which one of their reports (and associated data marts) had the “right” numbers. In the end, data marts left their business users wanting more – which often meant building another round of data silos with more disjointed data marts or data shadow systems and more versions of the truth.
An answer to the data mart problem was needed. Two competing architectural approaches emerged: a data hub with conformed data marts and a hub and spoke with both a data warehouse and data marts. The conformed dimensions approach tied disparate data marts into a logical single version of the truth without the need for a “nasty” data warehouse. It was intellectually very appealing, but ultimately only a small percentage of companies had the discipline to implement the approach successfully.
The hub-and-spoke approach eventually became more popular, judging by the number of companies that adopted it. Successful hub-and-spoke implementations borrowed heavily from the conformed dimension approach by using its data modeling and integration techniques. The key difference was that the hub and spoke built a physical data warehouse (data hub) rather than trying to achieve a virtual hub. That virtual hub was simple to draw but difficult to achieve in real-world situations.
Despite the adoption of the hub and spoke approach to unify data, however, the data silo genie was already out of the bottle. By the time companies adopted hub and spoke, they had already built countless operational data stores (ODS), data warehouses data marts, OLAP cubes and data shadow systems. They didn’t have the time or resources to bring these databases into the hub-and-spoke fold.
Now we’re into the corporate/enterprise data warehouse era. It began with a need to cut costs at the end of the Internet bubble. Companies were trying to do more with less. They started consolidating applications and databases along with their hardware platforms. This trend leveraged a lot of overcapacity that was bought during the bubble and made much better use of resources, including people. These initiatives were justified by reducing licensing, maintenance and upgrade costs both for software and hardware platforms. In addition, fewer people were needed to manage the consolidated platforms, further reducing costs.
Initially this consolidation era included a push to consolidate data marts onto big iron again – not necessarily mainframes, but enterprise-scale servers. This approach was pitched by the vendors selling the iron, and many of their customers did indeed consolidate data marts and save money. However, data mart consolidation also needed a data integration component. Just because you put data marts onto a single, physical database and server does not mean you integrated the data. Partially because it was expensive, this trend did not catch on with the masses.
The movement to a central DW architecture has silently and steadily emerged during this consolidation era. In a 2004 TDWI study, more than 60% of the respondents stated that they were adopting a central DW architecture. In a similar survey just a year before, the majority responded that they had a hub-and-spoke architecture.
The big data warehouse is back and, along with it, the need for a well thought-out, enterprise-wide data architecture. In my next column we’ll explore whether this is a good or bad thing, some of the concerns and potential landmines, and the benefits for your business.