Thursday, April 30, 2009

Data Warehousing is hard

It's hard. So very, very hard. In a perfect world, all the source systems would have universal keys, adhere to a global conceptual model, and have perfect data quality. HA! If only we lived in a perfect world.

We had a minor victory at work this week.

We have three (count 'em - three) sources of location data. This is store data. Three sources. One is considered fairly primary, another exists for reporting purposes, and a third exists in one of the three data warehouses that we have. The data warehouse I work on pulls data from two non-data warehouse sources. And ... of course, they're slightly different. So, we have a lot of code devoted to figuring out which one is right, and which one gets the higher preference, and understanding our location table (which, I guess in all fairness, is a fourth source), takes a fair bit of knowledge, since it contains data from two separate sources. It's kind of a mess.

Made all the more complicated by the fact that sometimes we receive transactions which aren't tied to any valid location at all. Some of our licensee locations have the ability to just put whatever location ID they want to in the transmission files. Sometimes those store managers want to report on separate counter agents, so they'll actually assign each person in the store a unique store ID!! Put that in your pipe and smoke it, data quality!

Crazy.

Anyway, back to the minor victory. We finally (we being the data management team in conjunction with certain clients) agreed that we could automate the process of bringing the primary source of location data into the data-warehouse. This isn't location data as much as it is hierarchy data. But, close enough. This is a big win because, get this - the primary source of this table currently is an email that gets sent out notifying downstream systems of a change in the location hierarchy, someone in production support has to create a ticket, and then updates the table manually. Whoa. What a mess.

As you can imagine, the table was more than a little different from the original source, but everyone involved agreed that the original source was "more better" so we decided to proceed with implementing the code that I wrote a couple of months ago. The code was tricky*, but I got it to work - but when it came time to implement it, I was like "Wow, this is a pretty big change. We may need to discuss this one first." Once everyone finally got on board, we can move forward with that code.

Minor victory. Every little bit helps.


* I had to read from a de-normalized Oracle table which was literally modeled after a COBOL flat file. So, I had to recurse through the table in order to build out the hierarchies. It was kind of ugly.

No comments:

Post a Comment