As we discussed in my previous column, big data warehouses are back in style. In a 2004 TDWI study, 60 percent of respondents indicated that they were adopting a centralized data warehousing architecture, i.e., using a single enterprise data warehouse (EDW) with no associated data marts or operational data store (ODS) systems. We have come full circle to the early 1990s, when this approach led the initial data warehousing wave.
In addition to the motivation of consolidation and cost cutting, many people felt that an EDW would be the easiest path to creating the elusive “single version of the truth.” After all, it lets you avoid creating those pesky data marts, not to mention managing the associated data integration to ensure that all their reports and analysis are consistent. Many see the data silos – multiple data warehouses, data marts, ODSs and cubes – created during the Internet boom and cry, “No more!” They think it’s overcomplicated and want to put everything into one data warehouse that is integrated and consistent. Because a centralized, standalone EDW is an easy architecture to describe and understand, it seems like the clear path to take. But is it really that simple?
Will EDWs become a best practice, and is that a good idea? Will the early successes of those adopting an EDW lead others toward this architecture?
An EDW was the first generation of data warehousing. It was simple and everyone could understand it. But that generation faced implementation and scope overload. Many projects failed outright or didn’t live up to expectations. People today think they can do better. They think that with today’s leading-edge thinking and technology they won’t have the same problems. They’ve got databases and extract, transform and load (ETL) tools that handle huge volumes of data; real-time updates and access are now available from enterprise application integration (EAI) and enterprise information integration (EII) tools; business intelligence (BI) tools can access data from anywhere, anytime and any way they want. In addition, people have the experience gained from multiple generations of data warehouses that the first generation did not have.
However, just having the technology and experience now doesn’t dictate that an EDW is a good idea.
Simplicity in a design is terrific, if it is the right approach. An EDW eliminates the complexity and expense of creating data marts with a hub-and-spoke architecture, but it is only the right choice if it meets your business users’ needs.
Why did we have data warehouses, data marts, operational data stores and cubes – was it because we could? Was it because we couldn’t agree on the EDW so everyone built their own? There is a little bit of truth to the first statement and, unfortunately, more truth than we care to admit for the latter. However, there is a method to the madness.
Data warehouses are built to integrate data from multiple source systems, cleanse the data, ensure consistency across the multiple sources and store historical data. The DW enables the data distribution that establishes an enterprise’s information backbone. Data marts are built to subset, aggregate and perform transformations for their target information consumers (business users). A data mart’s typical audience is a business function (such as finance) or a set of business processes (such as supply chain management). Data marts are really data franchises pulling data from your data warehouse and configuring, packaging and distributing it for the business community. OLAP cubes are also data franchises derived from data warehouses or data marts. These cubes can be considered mini-data marts.
An EDW is quicker and cheaper to build than a DW with associated data marts (hub-and-spoke architecture) at first blush. A more realistic cost comparison needs to take a holistic enterprise view that examines the total cost of ownership (TCO) of providing information for the business users to create their reports and perform their business analysis. The holistic view needs to account for the flow of data from its point of creation in the enterprise until it is consumed by the business users in their reports or analysis. Every time the data is accessed, extracted and transformed, it costs the enterprise time, resources and money. More importantly, it has the potential to reduce business productivity and lessen their ability to manage the business, because users may be spending time gathering and transforming data rather than analyzing the information and managing the business.
When you develop an EDW, you are building a database to perform data collection, cleansing and integration as well as data franchising and reporting. The same database is used as the vehicle to source, cleanse, transform, integrate, aggregate and provide access to the information that will eventually feed reports and business analysis. Databases have advanced significantly from the first generation of data warehouses, but have our design skills? When you design something to do all things, it may not do them all well. In fact, sometimes it does not do anything well. Let’s look at what results from an all-in-one EDW.
The data franchising approach – building data marts and cubes from a data warehouse – initially takes more time to build. However, this architecture is a very efficient method to prepare the data for business reports and the analytics that business users perform. Without this data franchising, each report and business analysis requires repeatedly performing filtering, aggregation, denormalization and data transformation that could have been done once in their data mart. Regardless of whether the IT group builds them or not, the data franchising processes must occur. The question is: where do you want them to occur? What is the most effective and efficient approach?
If you don’t have data marts, data franchising needs to take place within the BI reports generated by the EDW. It means that the reports are more complex to design, build and maintain. Queries are slower. Analysis is more difficult because business users need to understand much more complex data structures. Data marts, on the other hand, are prebuilt for better understanding and easier access, and they are customized for their customers.
Many companies have shifted the burden of data franchising onto their BI tools and, ultimately, their business users by adopting an EDW. This has two significant adverse effects. First, the BI environment takes more time to build, operate and maintain. You shift the cost of building a data mart once to building the same logic many times into a BI report or business user’s analysis. From a total cost of ownership perspective, this shift increases a company’s cost of providing the data warehousing environment. Additionally, it reduces the productivity of the business as it tries to analyze the data.
The second and more costly result of an EDW architecture is shifting data franchising from creating data marts to data shadow systems. Data shadow systems are built by business users to gather, integrate and transform the data needed to create their reports and perform their analytical activities. The business will use the tools that they are familiar with – Microsoft Excel, Microsoft Access and sometimes statistical packages such as SAS. Data shadow systems encompass large numbers of spreadsheets and business user-built databases, often in Microsoft Access, created to accomplish the data franchising that data marts could or should have provided. In the absence of an architecture that explicitly embraces data franchising, data shadow systems will flourish, repeatedly redoing the data processing that a data mart would have done once. Because these data shadow systems are generally neither documented nor designed in any architecture, the likelihood of incorrect information being created is high. In addition, these systems exacerbate the spread of data silos throughout an enterprise.
You are damned if you do and damned if you don’t. If you build data franchising via data marts, it costs you time, money and resources. The architecture is more complex, and you are responsible for any data integration issues.
If you don’t build the data marts and enable data franchising, you only need to build the EDW. However, there will likely be databases created by IT groups and data shadow systems built by business groups to create the missing data franchising functionality. You avoid the upfront cost of building the data franchising processing, but you end up with it being built later. The problem with data franchising being tacked on rather than architected from the beginning is that you end up with overlapping data silos. The data silos are created with redundant filtering, aggregations and data transformations. This approach is costly to maintain and operate, and it moves the enterprise farther from the goal of a single version of the truth.
The EDW may be simpler and less costly in the short run, and you don’t own the data integration issues once the data leaves the data warehouse. Still, the overall enterprise cost will be higher, and you are less likely to achieve a consistent, integrated set of information for the business users once you factor in report complexity and the hidden cost of data shadow systems.