Five Great Ways to Really Mess Up Your Data Warehouse During Loading

Book excerpt: Why a BI environment needs a solid architecture framework
September 24, 2015
The Four Legs of a Successful Business Intelligence (BI) Project Team
January 3, 2016
Show all

Five Great Ways to Really Mess Up Your Data Warehouse During Loading

blog_grenadeYour data warehouse is only as good as the data you load into it. But the data-loading process can be tricky, and there are ample opportunities to really mess it up if you don’t have the right experience. Here are a few that spring to mind:

1. Blindly change the ETL code

Go ahead, play around with the ETL code when something doesn’t look right, even if you don’t understand the business logic behind it.  It will start loading the data incorrectly, resulting in bad data. Won’t that be fun?

2. Testing is for wimps

Don’t test the data after changing the ETL code. If you do, you might see that it’s wrong. That might ruin your day.

3. When in doubt, add an index

Suppose your DW is being updated very quickly because many of its tables are loaded using parallel processes. And also suppose you’re using a data discovery tool that loads data in memory, allowing business people to analyze data in memory.

It’s pretty fast. But you want it faster. So how about adding indexes to the table? Don’t indexes speed up queries? Well, not really because the data discovery tool is performing the analysis in memory and does not use the indexes.  But you know what does? The parallel update processes does and it creates deadlocked processes.

This isn’t your father’s data warehouse, where a BI tool would query a relational database and indexes would actually improve query speed. They have the opposite effect and knock it dead.

4. Keep adding indexes

Some ETL tools use memory to process data for things like joining, filtering and look-ups. This makes processing faster and alleviates the need for indexing on relational database sources, just like data discovery tools.

But if you want to really slow things down, add some indexes anyway. They won’t be used in querying the data, but it will slow down processing because they have to be updated as the data is loaded.

5. Only nerds understand slowly changing dimensions 

You don’t want to understand slowly changing dimensions (SCD). Because if you did, you might miss an opportunity to break your data warehouse.

Some ETL tools have SCD transformations, which are great at helping to standardize processing. But if you can avoid learning about SCDs, you can improperly implement the SCD processes so your data gets corrupted.

This is great because then the warehouse isn’t tracking changes in dimensions, and when your customers’ addresses change or the prices of your company’s products change that data won’t get captured. Then everyone in the business group gets mad – and you can watch, knowing that you caused it.

3d-cover_75px(One way to keep from being the person who messed up the data warehouse is to read my Business Intelligence Guidebook – From Data Integration to Analytics. Chapter 10 will tell you all you need to know about slowly-changing dimensions!)

Leave a Reply

Your email address will not be published. Required fields are marked *