Religious wars. We have them in IT all the time. Everyone debates architecture. The “high priests” of our trade guard it. The approaches espoused by Ralph Kimball and Bill Inmon are at the center of one of the biggest data management religious battles.
Kimball’s approach defines the data warehouse (DW) as “The queryable source of data in the enterprise. The DW is nothing more than the union of all the constituent data marts.”1 Star schemas, conformed data marts and no need for those nasty data warehouses!
Inmon states that “The data warehouse is at the heart of information and analytical processing for the corporation…” but also that it “…is not the answer to all problems…”2 Inmon recommends building data marts fed solely from the DW: “There is only one legitimate feed into the data mart; that feed comes from the data warehouse.” Inmon disciples use normalized data, data warehouses, dependent data marts and that hub-and-spoke architecture. Who is right and who is wrong?
Only 16 percent of companies have implemented the Kimball approach with conformed data marts, versus 43 percent that use a DW with dependent data marts or a hub-and-spoke.3 If you add in the 18 percent response for central data warehouse-only implementations, the total following Inmon’s philosophy is more than 60 percent. This is also consistent with the Inmon architecture, especially if you dig deeper into its makeup.
However, merely looking at the physical characteristics, such as whether the DW or data marts are located on separate severs or databases, does not take into account its data architecture principles. In many ways, the Inmon and Kimball camps agree on quite a bit. They have similar approaches despite these two glaring differences:
Let’s look at the similarities. Both camps value data and feel it should be managed. Both camps believe the star schema is the best modeling technique for business intelligence (BI) enabled by data marts. The debate over normalized versus star schema is not about whether it should be used in data marts, but whether it should be used in the data warehouse – which Kimball says does not need to be built anyway.
Very importantly, both camps believe that the key to consistent data, regardless of how it is physically instantiated, is conforming dimensions. Your corporate numbers can only be consistent if your dimensions, such as products, customers, suppliers and employees, are conformed.
The problem of inconsistent dimensions has also appeared with ERP system offerings. Companies that have implemented enterprise resource planning (ERP) systems are faced with inconsistently defined dimensions, such as product and customers, across ERP application modules or instances. The ERP vendors, who are also offering their own data warehousing packages, are addressing the inconsistency problems by conforming the dimensions within their systems to enable consistent operational reporting. Conforming dimensions is being referred to in the ERP systems as master reference data management. However, regardless of its name, conforming dimensions is a best practice for both business intelligence and operational reporting.
A more subtle, but maybe surprising similarity is that hub-and-spoke is the best form of data integration. Don’t let the difference between a persistent DW versus a transient data staging area distract you from the fact that both camps recommend bringing the data to one construction site (DW versus data staging), which is a hub, and then distributing the data to data marts that are the spokes. Both sides recommend common ETL (extract, transform and load) processes to enable this hub-and-spoke architecture.
Why hub-and-spoke? When you examine ETL processing, that approach creates one-to-many (DW to data marts) versus many-to-many interfaces if you had a point-to-point approach from source systems to data marts. The one-to-many is initially simpler to implement (except that you have to design and plan up front) and generally more cost-effective in the long run. And, it’s more likely to ensure consistent (and conformed) dimensions. It is not surprising that companies using an ETL tool rather than hand coding overwhelmingly prefer the one-to-many architecture because it is more cost-effective. Hand coding tends to create pockets of hidden interface code. Because they encourage sharing and increase visibility, ETL tools reduce this problem.
The big disagreement is whether you need to build a data warehouse at all. There are horror stories of data warehouses costing too much, taking too much time to complete, having too few users, having a lot of data but little information, becoming monolith entities, and the IT group supporting them being unresponsive to business needs because they are so busy with data warehouse care and feeding. Data warehouses are sometimes called “data jails” – or worse.
However, is the problem the data warehouse or how it was built? Too often DW projects become big-bang projects that overwhelm overworked, under-funded IT departments. Not building a data warehouse may be appealing, but Kimball’s alternative isn’t necessarily a piece of cake. You’d still have to implement a hub-and-spoke, conforming the dimensions and feeding the dependent data marts via a data staging area. That’s the hard work. Storing the data in a persistent data warehouse is not that tough after that work is done.
From a pragmatic perspective, why not store the data in a data warehouse? Without a data warehouse, the data marts have to store all the detailed data they will ever need or it gets lost. Why not have dimensions and facts in one place? Why not have historical data available as a corporate asset? If the data staging area is not persistent, then how do you handle changes in organization and product structures (dimensions) that happen often enough in corporations today?
Regardless of what your philosophy is regarding data architecture, the reality is that most large corporations have many data warehouses or independent data marts. According to a survey by TDWI in Winter 2004, the average organization has two data warehouses, six independent data marts, 4.5 operational data stores (ODSs) and 28.5 spreadmarts (data shadow systems).
Why so many? There are historical reasons, such as mergers and acquisitions, ERP and DW groups building their own silos and, quite simply, multiple groups within a corporation doing their own thing. And now, even with these multiple silos, ERP and BI vendors are selling their own corporate performance management (CPM) solutions (with the accompanying DWs and data marts) that are quite separate from the existing corporate silos. There is often ample business justification for implementing the CPM solutions, but there are long-term data-integration consequences.
Because almost no one has a green field environment, what should they do? Work toward conforming the dimensions and implementing a hub-and-spoke data architecture, especially with regard to common ETL processing. You may not be able to consolidate data warehouses, but you should, at a minimum, work toward implementing Kimball’s data bus.
The best practices that you should embrace are using a hub-and-spoke data architecture with a reference data backbone (or conformed dimensions). With these practices, you can physically locate your data warehouse (hub) and data marts (spokes) centrally or distribute them on your network depending on your technology and organizational preferences. More importantly, do not overlook the fact that many data shadow systems — OLAP cubes, Microsoft Access databases and spreadsheet farms — are littered across your corporate landscape. The easier you make it to interface with your hub-and-spoke architecture and reference data backbone, the less likely it is these will spread out of control and the more likely it is that the reports generated by these systems will be consistent with your corporate data systems.
Now the next time someone discusses Inmon versus Kimball, you can say you agree with both.4
References: