Wednesday, May 6, 2009

Updates bad, inserts good

Yesterday in a meeting, we had a fairly spirited debate regarding one of our legacy data warehouses. This legacy data warehouse attempts to do some rudimentary match/consolidation on customer data using PL/SQL code. It's works OK, but not great. There are a handful of tickets currently open on this environment pointing out some flaws in the way it decided to match two apparently unrelated individuals.

That's a problem. Not a common one, mind you, but a problem nevertheless. We all agreed that our new data warehouse, which has a repository of customer data that's been matched and consolidated via Informatica's new Data Quality (IDQ) toolset is far superior, and during testing, we were able to show that IDQ's toolset was even better than the previous FirstLogic tool that we had purchased (which is a tool designed to do matching and consolidating). Both of which are probably an order of magnitude superior than anything pathetic little PL/SQL could ever hope to accomplish.

But one of the bigger issues that was discussed with regard to the legacy data warehouse was its use of Updates during the match process. If the process found a match, it would update the customer record with the new information. If the match was flawed, too bad. It just got replaced. Bad idea.

This discussion was held because prior to this meeting, was another meeting where it was decided that the match/consolidation should just be removed entirely from this warehouse. It just meant that instead of there being any attempt to consolidate, every transaction would point to a unique instance of a customer, even if that customer existed in the database 20 other times. This might sound inefficient, but when we considered how flawed this matching process is, we decided the best place to do this was in our new data warehouse, and anyone who wanted a matched source of customers should come to the new data warehouse. It was also good because some business clients don't want matched customer data - Auditing, Subrogation, Accounting, etc., - all want the data exactly as it was at the time of the transaction, and if the customer data changed from one transaction to the next, they only were interested in what the customer looked like at the time of each individual transaction.

This would be a win for them, since they would have access then to the very transactional data that they require.

It occurs to me now, that there is a way to have your cake and eat it, too. Type II dimensions - a type II dimension is fancy data warehouse speak for historical data in a table. Basically, you add a date/time stamp to your primary key, and every time you get a change to that particular row, you put a new date/time stamp on it and insert it. Ideally, you want to be able to relate your fact table or whatever else relates to your customer dimension down to that single instance of a customer, without sacrificing anything. You can also maintain a cross-reference table which correlates all original instances of a customer record (assuming you're getting customers from multiple sources [otherwise, what's the point??]) so you're not losing anything.

Updates bad. You should never update data in a data warehouse. Bad programmer. Bad.

No biscuit.

No comments:

Post a Comment