We've had a couple of interesting discussions this week regarding data quality. One of the major areas that we've been dealing with lately has to do with email addresses. We had a problem last Friday pulling a list of email addresses for a marketing campaign, and a couple of the records were obviously bogus (something like notvalidemail@acme-hackme.com) and then as an added bonus, whoever created them on the website added a carriage return to it. So, not only was the email bogus, but it ended up crashing my job because it threw the line count off.
I made the suggestion to the web team that they filter this kind of stuff out - turns out they do, but this particular record was loaded as part of a mass import of data from our 3rd party vendor which used to manage that stuff.
In investigating these records, it occurred to me that there could be a whole level of data quality "corrections" that could be made to the data that would make it better. For example, not only do we want to exclude the bad ones, but we also want to encourage users to correct the good ones - because if you're specifically signing up for an email alert, then it doesn't do you any good if the email address isn't valid at all.
One of the examples I'd discovered in our database were things like joe2yahoo.com or joe@yahoo.co, - where they fingered the comma instead of the m - or even cp, where their hand was off just slightly (conversely, cim or cin were also present). The joe2yahoo example resulted when someone didn't hold the shift key down, and got a 2 instead of an @.
I suggested that these patterns could be matched on, and then replaced, fixing the email address, but the web developers were all very hesitant - "What if joe@yahoo.com isn't even their real email address?", they asked, "You'd be turning an intentionally bogus email address into a valid one, and then we'd be spamming poor Joe who never even asked for our emails.
In fairness, I suspect if joe@yahoo.com is a valid email address, he probably gets a ton of spam anyway.
But their argument had some merit. So yes, it would seem that sometimes, data cleansing can sometimes have unforeseen side-effects.
Friday, May 29, 2009
Friday, May 22, 2009
I'm ready to kill someone
Apparently,
NULL != a number
Doesn't evaluate to anything in Informatica.
So, if you have two variables "existing_x" and "new_x" and you're trying to route off whether they are different or not, and you do something like this:
"existing_x != new_x" in your router condition, it doesn't satisfy the condition if one of the values is NULL, and the other one is not null.
What a pile of garbage.
I had to force the NULL value to be zero before it would work.
I'm also ready to kill someone. Whoever wrote the code to handle that would be a good candidate.
NULL != a number
Doesn't evaluate to anything in Informatica.
So, if you have two variables "existing_x" and "new_x" and you're trying to route off whether they are different or not, and you do something like this:
"existing_x != new_x" in your router condition, it doesn't satisfy the condition if one of the values is NULL, and the other one is not null.
What a pile of garbage.
I had to force the NULL value to be zero before it would work.
I'm also ready to kill someone. Whoever wrote the code to handle that would be a good candidate.
Beware the bit
This is probably unique to SQL Server, so if you don't use SQL server, you can probably ignore this tip.
Beware the bit data type. It's a curious thing. This just bit me in the butt, so I thought I'd share it.
If you pull in a bit datatype from a source, Informatica converts it to a string (1) in the qualifier. Odd, considering a bit is either 0 or 1.
If you try to write that to an int target, you'll always get 0 - why? Because the string(1) doesn't store "0" and "1", it stores "T" and "F".
When I get some free time (ha!) one of these days, I'm going to change the qualifier to a string(10) just to see if it actually stores "TRUE" and "FALSE", but I suspect it doesn't, since it defaults to just a string(1).
But, if you're going to write a bit to your target, you'll need to use an expression, something like:
IIF(bit_Flag = 'T', 1, 0)
Consider yourself warned. :)
Beware the bit data type. It's a curious thing. This just bit me in the butt, so I thought I'd share it.
If you pull in a bit datatype from a source, Informatica converts it to a string (1) in the qualifier. Odd, considering a bit is either 0 or 1.
If you try to write that to an int target, you'll always get 0 - why? Because the string(1) doesn't store "0" and "1", it stores "T" and "F".
When I get some free time (ha!) one of these days, I'm going to change the qualifier to a string(10) just to see if it actually stores "TRUE" and "FALSE", but I suspect it doesn't, since it defaults to just a string(1).
But, if you're going to write a bit to your target, you'll need to use an expression, something like:
IIF(bit_Flag = 'T', 1, 0)
Consider yourself warned. :)
Tuesday, May 19, 2009
Binary log files
I'm not a fan of version 8's binary log files. They're kind of hard to work with, and especially so from the Windows Explorer viewpoint, since I haven't yet found a command line tool to associate to those types which can open them (if you know of one, I'd love to hear what it is!).
But Informatica isn't unforgiving in this regard, there's an option you can set if you're kind of a luddite like me, and prefer the standard text files. Just choose "Write Backwards Compatible Log File" in the Session's properties.
Be aware that this will actually create a PAIR of log files for your job - a .bin logfile, and a regular ASCII log file. Curiously enough, the .bin log file will be twice as large as your regular ASCII log file (so much for progress). Workflow manager will still read the .bin file - so it will still look all fancy (as compared to previous Informatica versions), but you'll also have the regular file in case you need to do some historical investigation, you can do so easily.
You can use the "Import" function within Workflow monitor, but I find it to be kind of a nuisance. Go to "Tools" and "Import Log" to import .bin files directly into Workflow Monitor if you want to read old .bin files.
But Informatica isn't unforgiving in this regard, there's an option you can set if you're kind of a luddite like me, and prefer the standard text files. Just choose "Write Backwards Compatible Log File" in the Session's properties.
Be aware that this will actually create a PAIR of log files for your job - a .bin logfile, and a regular ASCII log file. Curiously enough, the .bin log file will be twice as large as your regular ASCII log file (so much for progress). Workflow manager will still read the .bin file - so it will still look all fancy (as compared to previous Informatica versions), but you'll also have the regular file in case you need to do some historical investigation, you can do so easily.
You can use the "Import" function within Workflow monitor, but I find it to be kind of a nuisance. Go to "Tools" and "Import Log" to import .bin files directly into Workflow Monitor if you want to read old .bin files.
Monday, May 18, 2009
2 sessions, 1 mapping
One of the neat things about Informatica is its ability to re-use code. You can take advantage of this by writing generic mappings that you intend to run in different ways, perhaps with a slight variation on the SQL Override, for example.
We have a couple of mappings which we are doing this with. Our customer match/consolidation sessions are replicated into 8 pieces, each doing a subset of the alphabet. We use a parameter file in the workflow to tell each session which letter to process. This way, we only have to maintain one version of the mapping, and it gets called 8 different ways.
The parameter file we use has entries for each session which look something like this:
[s_work_CustMasterMatched_AtoB]
$$STREAM=AtoB
$$STREAM_BEGIN=A
$$STREAM_END=Bzzzzz
$PMSessionLogFile=s_work_CustMasterMatched_AtoB.log
This is handy, because if the volume of data is too large to process (and we run out of memory trying to process it), I can cut it in half by processing only A to Azzzzz. And I don't have to change the code.
In your session properties, be sure to set your log file name to $PMSessionLogFile - anything other than this, and you're not going to get the name of your logfile right. Informatica looks for this variable name specifically. You can't even be clever and use something like $logfile here. It just won't work - you'll get a log named $logfile.bin - and all sessions sharing that name will write to the same log. No good.
Also remember that if you modify the SQL override at the workflow level, and changes you make to the mapping won't be reflected in the workflow (since workflow changes override session changes, which override mapping changes). So, be diligent in how you make updates to these kinds of jobs.
We have a couple of mappings which we are doing this with. Our customer match/consolidation sessions are replicated into 8 pieces, each doing a subset of the alphabet. We use a parameter file in the workflow to tell each session which letter to process. This way, we only have to maintain one version of the mapping, and it gets called 8 different ways.
The parameter file we use has entries for each session which look something like this:
[s_work_CustMasterMatched_AtoB]
$$STREAM=AtoB
$$STREAM_BEGIN=A
$$STREAM_END=Bzzzzz
$PMSessionLogFile=s_work_CustMasterMatched_AtoB.log
This is handy, because if the volume of data is too large to process (and we run out of memory trying to process it), I can cut it in half by processing only A to Azzzzz. And I don't have to change the code.
In your session properties, be sure to set your log file name to $PMSessionLogFile - anything other than this, and you're not going to get the name of your logfile right. Informatica looks for this variable name specifically. You can't even be clever and use something like $logfile here. It just won't work - you'll get a log named $logfile.bin - and all sessions sharing that name will write to the same log. No good.
Also remember that if you modify the SQL override at the workflow level, and changes you make to the mapping won't be reflected in the workflow (since workflow changes override session changes, which override mapping changes). So, be diligent in how you make updates to these kinds of jobs.
Wednesday, May 13, 2009
Problem solved
Two blog entries in one day. I must be trying to make up for yesterday.
I just spent a couple of hours trying to solve this problem, and I think I finally did.
I had a session with a $$LASTRUN date which should receive the sessstarttime of the mapping and then set that when the session completes so that it knows where to pick up from the previous run.
Except it wasn't working. When I ran the session, the mapping variable ended up with the same exact value as it had started with. Which frustrated me, because I have other jobs that do the exact same thing and they do work. So, I ended up comparing this mapping (which wasn't working) and one that was and found this:
The mapping which worked had a default mapping variable value of '2009-04-17 12:00:00'.
And the default mapping variable value of the one not working was '04/17/2009 12:00:00'.
Note the ever-so-subtle difference.
Now, why this caused the mapping variable not to get assigned the correct value, I'll never know.
But keep this tip in your back pocket in case you run into problems with mappings behaving in this way.
I just spent a couple of hours trying to solve this problem, and I think I finally did.
I had a session with a $$LASTRUN date which should receive the sessstarttime of the mapping and then set that when the session completes so that it knows where to pick up from the previous run.
Except it wasn't working. When I ran the session, the mapping variable ended up with the same exact value as it had started with. Which frustrated me, because I have other jobs that do the exact same thing and they do work. So, I ended up comparing this mapping (which wasn't working) and one that was and found this:
The mapping which worked had a default mapping variable value of '2009-04-17 12:00:00'.
And the default mapping variable value of the one not working was '04/17/2009 12:00:00'.
Note the ever-so-subtle difference.
Now, why this caused the mapping variable not to get assigned the correct value, I'll never know.
But keep this tip in your back pocket in case you run into problems with mappings behaving in this way.
Lessons learned
I've learned a few tricks in Informatica that makes things easier and simpler. I'm a fan of simpler.
In a lookup, you can uncheck the ports that you don't need. This makes the lookup faster, since it's not pulling back columns that aren't needed from the database. But never, ever delete these columns from the lookup transformation. Just uncheck them. Deleting them doesn't break it, but you never know when you will want to add a column back to it at some future date, and then you'll have to recreate the lookup from scratch.
There's also no reason to send every column into an expression if you're only going to modify one column. How many times have you seen a mapping with a source, source qualifier, every column moved into an expression, where one column is transformed slightly, and then every column is then written to a target - or some variation. I have. It's a waste. redirect all those columns directly to the target if you're not modifying them in the expression.
I also find it very helpful to rename my targets. There's nothing more useless than having 9 targets all called target, target1, target11, target111, etc. If you're trying to figure out if updates are working for a certain scenario, you can't unless the target is actually called what it's purpose is. I like being description. Table_name_insert and table_name_update are popular. If there's more than those, I try to add what kind of insert or update it is. That makes checking the run properties a lot easier.
Don't forget to set your schema names for tables in lookups and targets, and define the cache folder for lookups. This is especially true for lookups in mapplets which get re-used. None of the original developers defined the cache folder or the schema name for any of the tables in the mapplet lookups. For example, all of our tables are in the IDS schema in our database, which means I have to qualify them all, since SQL Server defaults to dbo. And I have to do it every time I re-use a mapplet. That drives me bonkers. Set it once, and then you don't have to override it at the session level. I find that the less I have to override in the session, the happier I am.
Another area that can really bite you is the SQL Qualifier. This thing has caused me a ton of grief. The major thing to remember when working with a SQL Qualifier is that the column order of you select is the same order your columns need to be listed in. So, if you're SQL override selects A, C, B from your table, your ports damned well better be in that order, otherwise you're going to get B in your C column, and vice-versa. And if they're both the same datatype (or even remotely close), Informatica won't care. I find a good healthy dose of double-checking can alleviate testing headaches here. Generating the SQL inside the SQL qualifier is also a good idea sometimes to see what you're producing. It's also a good idea to override the query here if you don't need to process everything in your source. I mean, you could select every row from your table, and then use a filter to ignore the rows you don't need - but that's not effecient. A better choice is to just select only those rows you need to begin with.
Lastly, I know there are people reading the blog, but you guys have been quiet. Any particular areas of discussion you'd like me to cover?
In a lookup, you can uncheck the ports that you don't need. This makes the lookup faster, since it's not pulling back columns that aren't needed from the database. But never, ever delete these columns from the lookup transformation. Just uncheck them. Deleting them doesn't break it, but you never know when you will want to add a column back to it at some future date, and then you'll have to recreate the lookup from scratch.
There's also no reason to send every column into an expression if you're only going to modify one column. How many times have you seen a mapping with a source, source qualifier, every column moved into an expression, where one column is transformed slightly, and then every column is then written to a target - or some variation. I have. It's a waste. redirect all those columns directly to the target if you're not modifying them in the expression.
I also find it very helpful to rename my targets. There's nothing more useless than having 9 targets all called target, target1, target11, target111, etc. If you're trying to figure out if updates are working for a certain scenario, you can't unless the target is actually called what it's purpose is. I like being description. Table_name_insert and table_name_update are popular. If there's more than those, I try to add what kind of insert or update it is. That makes checking the run properties a lot easier.
Don't forget to set your schema names for tables in lookups and targets, and define the cache folder for lookups. This is especially true for lookups in mapplets which get re-used. None of the original developers defined the cache folder or the schema name for any of the tables in the mapplet lookups. For example, all of our tables are in the IDS schema in our database, which means I have to qualify them all, since SQL Server defaults to dbo. And I have to do it every time I re-use a mapplet. That drives me bonkers. Set it once, and then you don't have to override it at the session level. I find that the less I have to override in the session, the happier I am.
Another area that can really bite you is the SQL Qualifier. This thing has caused me a ton of grief. The major thing to remember when working with a SQL Qualifier is that the column order of you select is the same order your columns need to be listed in. So, if you're SQL override selects A, C, B from your table, your ports damned well better be in that order, otherwise you're going to get B in your C column, and vice-versa. And if they're both the same datatype (or even remotely close), Informatica won't care. I find a good healthy dose of double-checking can alleviate testing headaches here. Generating the SQL inside the SQL qualifier is also a good idea sometimes to see what you're producing. It's also a good idea to override the query here if you don't need to process everything in your source. I mean, you could select every row from your table, and then use a filter to ignore the rows you don't need - but that's not effecient. A better choice is to just select only those rows you need to begin with.
Lastly, I know there are people reading the blog, but you guys have been quiet. Any particular areas of discussion you'd like me to cover?
Monday, May 11, 2009
Shadow IT and the chaos it creates
Shadow IT is a bad, bad thing. In case you're not familiar with the term, Shadow IT refers to people or even entire teams residing within the business community who have the means and the capability to create programs, procure software, or even purchase and manage IT and IT-related hardware.
One classic example that comes to mind was Wal-Mart where, the neglected Real Estate division who hadn't gotten any love from IT, decided to take matters into their own hands. Imagine the CIO's surprise when, years later, they discovered an entire AS/400 machine and a small team of developers supporting it within the Real Estate building!
Shadow IT is something we deal with within the business community where I work. It's not quite as bad as people going out and buying entire mainframe computers, but the damage is still felt, nonetheless. Here, we've got people making bad decisions about data modeling. Data modeling is one of those things that smaller companies tend to overlook. Anyone can create a table - so why do you need data modelers for that, when you've got DBAs? That's kind of like asking, why do you need engineers designing cars when you could just hire a bunch of mechanics to just keep fixing all the old cars forever. That would work, right?
Bad idea.
So, one of the pieces of data I'm trying to load this week is travel agency agents. This is a table of licensed travel agents that we've contracted with - and for every transaction they divert to us, we cut them a small amount of money. This is in addition to the commission we pay to the travel agency itself. Of course, travel agent is a child of a travel agency. But there's a problem with this. In our data warehouse, travel agency is branded - which means our two different brands are licensed to a travel agency separately. So, the same travel agency could be licensed to both brands, and we store that travel agency twice, once for each brand. This isn't that big of a deal, except that travel agents (the people who actually work at these travel agencies) aren't branded. The very idea is ludicrous, so of course, when I'm bringing this data in, and I'm trying to correlate it to the parent record - I can't, because I don't know which brand to choose.
This all points back to the root of the issue which is, at some point in time, someone - probably from the business side of the house - created a data model for travel agencies and decided to include "brand" as part of that table's primary key. Not, we're all paying the price for that. Even more frustrating, is the business has to hand-key everything in twice when they license a new travel agent, and you get data entry mistakes. So, while the travel agency names match on 95% of all the pairs, 5% of them are different - now you've got data quality mistakes based on a poor data model.
Never underestimate the power of data modeling.
One classic example that comes to mind was Wal-Mart where, the neglected Real Estate division who hadn't gotten any love from IT, decided to take matters into their own hands. Imagine the CIO's surprise when, years later, they discovered an entire AS/400 machine and a small team of developers supporting it within the Real Estate building!
Shadow IT is something we deal with within the business community where I work. It's not quite as bad as people going out and buying entire mainframe computers, but the damage is still felt, nonetheless. Here, we've got people making bad decisions about data modeling. Data modeling is one of those things that smaller companies tend to overlook. Anyone can create a table - so why do you need data modelers for that, when you've got DBAs? That's kind of like asking, why do you need engineers designing cars when you could just hire a bunch of mechanics to just keep fixing all the old cars forever. That would work, right?
Bad idea.
So, one of the pieces of data I'm trying to load this week is travel agency agents. This is a table of licensed travel agents that we've contracted with - and for every transaction they divert to us, we cut them a small amount of money. This is in addition to the commission we pay to the travel agency itself. Of course, travel agent is a child of a travel agency. But there's a problem with this. In our data warehouse, travel agency is branded - which means our two different brands are licensed to a travel agency separately. So, the same travel agency could be licensed to both brands, and we store that travel agency twice, once for each brand. This isn't that big of a deal, except that travel agents (the people who actually work at these travel agencies) aren't branded. The very idea is ludicrous, so of course, when I'm bringing this data in, and I'm trying to correlate it to the parent record - I can't, because I don't know which brand to choose.
This all points back to the root of the issue which is, at some point in time, someone - probably from the business side of the house - created a data model for travel agencies and decided to include "brand" as part of that table's primary key. Not, we're all paying the price for that. Even more frustrating, is the business has to hand-key everything in twice when they license a new travel agent, and you get data entry mistakes. So, while the travel agency names match on 95% of all the pairs, 5% of them are different - now you've got data quality mistakes based on a poor data model.
Never underestimate the power of data modeling.
Friday, May 8, 2009
My wife is how old?
When designing applications, I think it's important to put "sanity" checks into the front-end systems. I'll give you an example.
Years ago, I'd taken a new job and enrolled in the medical benefits program. As is common with these programs, they need your social security, and a date of birth to validate membership.
My wife had a prescription to fill, and we were at the pharmacy trying to get it filled, and while I was in the system, my wife was not. The lady behind the counter told me that we needed to go back to HR to figure out what was wrong. But, after a phone call with the provider, and some more sleuthing on the cashier's part, she finally figured out what had happened. Apparently, when I had entered the form for my wife, my "7" looked like a "9" and they misread my wife's birth date as "1991" instead of "1971". Since this was somewhere around the year 2000, that would have meant my wife was nine years old. I'm pretty sure even Arkansas doesn't allow that to happen.
That's a perfect example of a data quality sanity check. When the person keying that information in entered it, the system could have *beeped* and prompted a message that said "This spouse is less than 16 years old (16 is probably a reasonable cut off for the age of a spouse), is that correct? And the entry clerk could have taken a second look at the form, instead of just blindly entering it in, and probably would have stopped, fixed the issue, and then went on. Data quality problem avoided.
On a related note, I had a friend in high school who married a 13 year old girl. They had to do it in Oklahoma, since Arkansas (where I lived at the time) wouldn't allow it. Crazy.
Years ago, I'd taken a new job and enrolled in the medical benefits program. As is common with these programs, they need your social security, and a date of birth to validate membership.
My wife had a prescription to fill, and we were at the pharmacy trying to get it filled, and while I was in the system, my wife was not. The lady behind the counter told me that we needed to go back to HR to figure out what was wrong. But, after a phone call with the provider, and some more sleuthing on the cashier's part, she finally figured out what had happened. Apparently, when I had entered the form for my wife, my "7" looked like a "9" and they misread my wife's birth date as "1991" instead of "1971". Since this was somewhere around the year 2000, that would have meant my wife was nine years old. I'm pretty sure even Arkansas doesn't allow that to happen.
That's a perfect example of a data quality sanity check. When the person keying that information in entered it, the system could have *beeped* and prompted a message that said "This spouse is less than 16 years old (16 is probably a reasonable cut off for the age of a spouse), is that correct? And the entry clerk could have taken a second look at the form, instead of just blindly entering it in, and probably would have stopped, fixed the issue, and then went on. Data quality problem avoided.
On a related note, I had a friend in high school who married a 13 year old girl. They had to do it in Oklahoma, since Arkansas (where I lived at the time) wouldn't allow it. Crazy.
Wednesday, May 6, 2009
Neat trick in Informatica
I just did something by error, and ended up learning a little trick.
You can easily make a source a target, and vice-versa.
Click on the Target Analyzer window in Designer, and then from navigator, open the Source folder. Drag a source into the target window.
BAM!
The source is now a target.
Reverse to make a target a source. Easy. That's a pretty nifty little trick.
This also works easily if you want to copy code from development to production. Just open the mapping analyzer window, and drag it from whatever folder you're copying from into it. This way, you don't have to scroooooll all the way down to where you want to place it. Much easier.
You can easily make a source a target, and vice-versa.
Click on the Target Analyzer window in Designer, and then from navigator, open the Source folder. Drag a source into the target window.
BAM!
The source is now a target.
Reverse to make a target a source. Easy. That's a pretty nifty little trick.
This also works easily if you want to copy code from development to production. Just open the mapping analyzer window, and drag it from whatever folder you're copying from into it. This way, you don't have to scroooooll all the way down to where you want to place it. Much easier.
Updates bad, inserts good
Yesterday in a meeting, we had a fairly spirited debate regarding one of our legacy data warehouses. This legacy data warehouse attempts to do some rudimentary match/consolidation on customer data using PL/SQL code. It's works OK, but not great. There are a handful of tickets currently open on this environment pointing out some flaws in the way it decided to match two apparently unrelated individuals.
That's a problem. Not a common one, mind you, but a problem nevertheless. We all agreed that our new data warehouse, which has a repository of customer data that's been matched and consolidated via Informatica's new Data Quality (IDQ) toolset is far superior, and during testing, we were able to show that IDQ's toolset was even better than the previous FirstLogic tool that we had purchased (which is a tool designed to do matching and consolidating). Both of which are probably an order of magnitude superior than anything pathetic little PL/SQL could ever hope to accomplish.
But one of the bigger issues that was discussed with regard to the legacy data warehouse was its use of Updates during the match process. If the process found a match, it would update the customer record with the new information. If the match was flawed, too bad. It just got replaced. Bad idea.
This discussion was held because prior to this meeting, was another meeting where it was decided that the match/consolidation should just be removed entirely from this warehouse. It just meant that instead of there being any attempt to consolidate, every transaction would point to a unique instance of a customer, even if that customer existed in the database 20 other times. This might sound inefficient, but when we considered how flawed this matching process is, we decided the best place to do this was in our new data warehouse, and anyone who wanted a matched source of customers should come to the new data warehouse. It was also good because some business clients don't want matched customer data - Auditing, Subrogation, Accounting, etc., - all want the data exactly as it was at the time of the transaction, and if the customer data changed from one transaction to the next, they only were interested in what the customer looked like at the time of each individual transaction.
This would be a win for them, since they would have access then to the very transactional data that they require.
It occurs to me now, that there is a way to have your cake and eat it, too. Type II dimensions - a type II dimension is fancy data warehouse speak for historical data in a table. Basically, you add a date/time stamp to your primary key, and every time you get a change to that particular row, you put a new date/time stamp on it and insert it. Ideally, you want to be able to relate your fact table or whatever else relates to your customer dimension down to that single instance of a customer, without sacrificing anything. You can also maintain a cross-reference table which correlates all original instances of a customer record (assuming you're getting customers from multiple sources [otherwise, what's the point??]) so you're not losing anything.
Updates bad. You should never update data in a data warehouse. Bad programmer. Bad.
No biscuit.
That's a problem. Not a common one, mind you, but a problem nevertheless. We all agreed that our new data warehouse, which has a repository of customer data that's been matched and consolidated via Informatica's new Data Quality (IDQ) toolset is far superior, and during testing, we were able to show that IDQ's toolset was even better than the previous FirstLogic tool that we had purchased (which is a tool designed to do matching and consolidating). Both of which are probably an order of magnitude superior than anything pathetic little PL/SQL could ever hope to accomplish.
But one of the bigger issues that was discussed with regard to the legacy data warehouse was its use of Updates during the match process. If the process found a match, it would update the customer record with the new information. If the match was flawed, too bad. It just got replaced. Bad idea.
This discussion was held because prior to this meeting, was another meeting where it was decided that the match/consolidation should just be removed entirely from this warehouse. It just meant that instead of there being any attempt to consolidate, every transaction would point to a unique instance of a customer, even if that customer existed in the database 20 other times. This might sound inefficient, but when we considered how flawed this matching process is, we decided the best place to do this was in our new data warehouse, and anyone who wanted a matched source of customers should come to the new data warehouse. It was also good because some business clients don't want matched customer data - Auditing, Subrogation, Accounting, etc., - all want the data exactly as it was at the time of the transaction, and if the customer data changed from one transaction to the next, they only were interested in what the customer looked like at the time of each individual transaction.
This would be a win for them, since they would have access then to the very transactional data that they require.
It occurs to me now, that there is a way to have your cake and eat it, too. Type II dimensions - a type II dimension is fancy data warehouse speak for historical data in a table. Basically, you add a date/time stamp to your primary key, and every time you get a change to that particular row, you put a new date/time stamp on it and insert it. Ideally, you want to be able to relate your fact table or whatever else relates to your customer dimension down to that single instance of a customer, without sacrificing anything. You can also maintain a cross-reference table which correlates all original instances of a customer record (assuming you're getting customers from multiple sources [otherwise, what's the point??]) so you're not losing anything.
Updates bad. You should never update data in a data warehouse. Bad programmer. Bad.
No biscuit.
Tuesday, May 5, 2009
Varchar() vs. Char(); the debate rages on
I have this ongoing feud with SQL Server - no, not Microsoft, and not any specific purpose, the database itself.
See, in general, I like the idea behind the varchar() data type, but there's something wholly unclean about it that bugs me. I think it's the variability in it that I don't care for. Whatever the reason, a while back I set out to make some "adjustments" to the database. I'd read that varchar() columns in SQL Server consumed n+2 bytes of space, where n was the length of the specific data element being stored and 2 bytes were needed for varchar() overhead. Enter the brain surgeons that were the previous developers to the system I support. For some hair-brained reason, we actually had varchar(1) columns in our database. Proposterous! For all intents and purposes, a varchar(1) column should take up a minimum of 2 bytes (for null values) and 3 bytes for non-null values. What a ridiculous waste of space!! And that's just the data modeler - imagine how much database space that consumes. :)
We also had varchar(10) for a field that was ALWAYS 4 bytes long - and varchar(70) for fields that never, ever, ever exceeded 20 bytes. My sense of perfection and harmony within the universe was being violated. I needed to do something about this.
So, quietly, I set about fixing this gross injustice. First, I changed the varchar(1) fields to char(1) fields. Next, I set my sights on those that needed to be char(4) - but then I got clever - "I know, I'll check the database properties for the table. That way, I can see how much space I'm saving!" - so I checked the size of the table before and after one of the field conversions - BIGGER!! What the heck!? "How can that be?!", I pondered, shaking my fist at the screen. "Everything I know is wrong!"
This is where the feud comes in. SQL Server is currently laughing at me, but I'm not done yet. I'll figure this mystery out eventually.
Meanwhile, Informatica decided to wreak havoc on my life. How? Well, apparently, there's a major difference in the way Informatica treats, say, a varchar(10) and a char(3). One of the fields I changed was a transaction type, and we only have two - and the biggest one is 3 bytes long, the other is 2. So, I changed the field to be char(3). Why not? Every field is populated (it's not null), so at worst I'm losing 1 byte.
Except that when Informatica pulls this data back from a lookup, when it was a varchar(10) before, it was handing me "XX" on a silver platter. Now, because it is char(3), I'm getting back "XX ". Only, this time instead of on a silver platter, it comes back with a side of "here's an extra space you didn't ask for, nor do you want". Kind of like putting nuts in moist, chewy brownies. Why do people do that?!
So, take this to heart - if you decide to change a field from varchar() to char(), be aware that Informatica is going to handle that completely differently - in ways that you may not like.
See, in general, I like the idea behind the varchar() data type, but there's something wholly unclean about it that bugs me. I think it's the variability in it that I don't care for. Whatever the reason, a while back I set out to make some "adjustments" to the database. I'd read that varchar() columns in SQL Server consumed n+2 bytes of space, where n was the length of the specific data element being stored and 2 bytes were needed for varchar() overhead. Enter the brain surgeons that were the previous developers to the system I support. For some hair-brained reason, we actually had varchar(1) columns in our database. Proposterous! For all intents and purposes, a varchar(1) column should take up a minimum of 2 bytes (for null values) and 3 bytes for non-null values. What a ridiculous waste of space!! And that's just the data modeler - imagine how much database space that consumes. :)
We also had varchar(10) for a field that was ALWAYS 4 bytes long - and varchar(70) for fields that never, ever, ever exceeded 20 bytes. My sense of perfection and harmony within the universe was being violated. I needed to do something about this.
So, quietly, I set about fixing this gross injustice. First, I changed the varchar(1) fields to char(1) fields. Next, I set my sights on those that needed to be char(4) - but then I got clever - "I know, I'll check the database properties for the table. That way, I can see how much space I'm saving!" - so I checked the size of the table before and after one of the field conversions - BIGGER!! What the heck!? "How can that be?!", I pondered, shaking my fist at the screen. "Everything I know is wrong!"
This is where the feud comes in. SQL Server is currently laughing at me, but I'm not done yet. I'll figure this mystery out eventually.
Meanwhile, Informatica decided to wreak havoc on my life. How? Well, apparently, there's a major difference in the way Informatica treats, say, a varchar(10) and a char(3). One of the fields I changed was a transaction type, and we only have two - and the biggest one is 3 bytes long, the other is 2. So, I changed the field to be char(3). Why not? Every field is populated (it's not null), so at worst I'm losing 1 byte.
Except that when Informatica pulls this data back from a lookup, when it was a varchar(10) before, it was handing me "XX" on a silver platter. Now, because it is char(3), I'm getting back "XX ". Only, this time instead of on a silver platter, it comes back with a side of "here's an extra space you didn't ask for, nor do you want". Kind of like putting nuts in moist, chewy brownies. Why do people do that?!
So, take this to heart - if you decide to change a field from varchar() to char(), be aware that Informatica is going to handle that completely differently - in ways that you may not like.
Monday, May 4, 2009
Ones and Zeroes
I love IT stories. IT is such a small, close-knit community. It seems quite absurd, but it's fairly common to end up working alongside someone you worked with years before. Case in point, at my current client, there was another contractor (who isn't here anymore) who I knew from a previous job nearly 10 years before!! I also had a guy on my exact same team at two completely different jobs in two totally different cities.
So, it's probably not a good idea to piss off people you work with. Because it seems highly likely (despite how implausible it might seem) that you might end up crossing paths with them again. I believe this situation becomes even more prevalent when you start specializing in specific areas of IT. For example, I suspect the Teradata DBA community is probably a fairly close-knit group, and I have to suspect that given there are maybe a dozen companies in the country that utilize Ab Initio as their ETL tool of choice, that a great many core Ab Initio contractors probably interact with a fair amount of regularity. It's probably less of a problem for someone who just writes VB code, or C# code, as those people are a lot more common.
I had the pleasure of working alongside a guy named John when I got out of college. I'll call him John, because John is his real name. I was fresh out of college - not even 22 years old yet, and John was this sage fountain full of infinite amounts of wisdom. He was in his 50s - and had worked at a lot of places. John was planning on retiring at 55, and by all accounts, he probably is retired by now. I hope for his sake he is, at least. John was so tired of programming. He'd been doing it for so long. He used to joke with me by saying, "Let's see, if you play your cards right, maybe you could retire at 60, you're 21 now... Just think, maybe 40 more years of this." And then he would laugh at me. John had a knack for reminding me of that usually during some stressful day where we'd been chewed out by our asshole of a manager, or when I was struggling to fix a bug in some program somewhere. One of the pieces of wisdom John left me with was the idea that "It's just a job". John himself had picked up this concept from someone else, and imparted it unto me. It's nothing spectacular - but just a way to put everything into perspective. Where he'd worked once there was some threat of layoffs, and he'd gotten nervous, and a seemingly unconcerned co-worker tried to quell his fears "It's just a job", they said, "If I lose this job, I'll leave, and then I'll get another job. It's just a job."
It's very true. Informatica people are in high demand, and even if I were to get laid off, I'd find another job. It's just a job. I enjoy what I do, and I could easily do it somewhere else just as easily.
I use to have this hangup about the "kind" of place I worked. My first job out of college was retail - and all the textbook examples are retail examples. You have items, and sales, and all that other retail kinds of stuff. And just when I'd thought I'd kind of figured it all out, I left and went to work for a telecommunications company. Holy cow, that was totally different. But somewhere along the line, I'd shunned looking at places like insurance, or financial firms since I'd kind of felt like those weren't as glamorous. But then at some point, it kind of struck me once I'd got pretty good at my job in the telecommunications company - it was irrelevant. This has continued to be true everywhere I've been, even though to date I've worked retail, telecom, government, financial, briefly entertainment, and now travel. It's just data.
This is my corollary to the "It's just a job" concept. It's just data. As an Informatica programmer, I find myself caring less and less about what it is I'm moving from database to database. It's just data. It's all just 1s and 0s.
So, it's probably not a good idea to piss off people you work with. Because it seems highly likely (despite how implausible it might seem) that you might end up crossing paths with them again. I believe this situation becomes even more prevalent when you start specializing in specific areas of IT. For example, I suspect the Teradata DBA community is probably a fairly close-knit group, and I have to suspect that given there are maybe a dozen companies in the country that utilize Ab Initio as their ETL tool of choice, that a great many core Ab Initio contractors probably interact with a fair amount of regularity. It's probably less of a problem for someone who just writes VB code, or C# code, as those people are a lot more common.
I had the pleasure of working alongside a guy named John when I got out of college. I'll call him John, because John is his real name. I was fresh out of college - not even 22 years old yet, and John was this sage fountain full of infinite amounts of wisdom. He was in his 50s - and had worked at a lot of places. John was planning on retiring at 55, and by all accounts, he probably is retired by now. I hope for his sake he is, at least. John was so tired of programming. He'd been doing it for so long. He used to joke with me by saying, "Let's see, if you play your cards right, maybe you could retire at 60, you're 21 now... Just think, maybe 40 more years of this." And then he would laugh at me. John had a knack for reminding me of that usually during some stressful day where we'd been chewed out by our asshole of a manager, or when I was struggling to fix a bug in some program somewhere. One of the pieces of wisdom John left me with was the idea that "It's just a job". John himself had picked up this concept from someone else, and imparted it unto me. It's nothing spectacular - but just a way to put everything into perspective. Where he'd worked once there was some threat of layoffs, and he'd gotten nervous, and a seemingly unconcerned co-worker tried to quell his fears "It's just a job", they said, "If I lose this job, I'll leave, and then I'll get another job. It's just a job."
It's very true. Informatica people are in high demand, and even if I were to get laid off, I'd find another job. It's just a job. I enjoy what I do, and I could easily do it somewhere else just as easily.
I use to have this hangup about the "kind" of place I worked. My first job out of college was retail - and all the textbook examples are retail examples. You have items, and sales, and all that other retail kinds of stuff. And just when I'd thought I'd kind of figured it all out, I left and went to work for a telecommunications company. Holy cow, that was totally different. But somewhere along the line, I'd shunned looking at places like insurance, or financial firms since I'd kind of felt like those weren't as glamorous. But then at some point, it kind of struck me once I'd got pretty good at my job in the telecommunications company - it was irrelevant. This has continued to be true everywhere I've been, even though to date I've worked retail, telecom, government, financial, briefly entertainment, and now travel. It's just data.
This is my corollary to the "It's just a job" concept. It's just data. As an Informatica programmer, I find myself caring less and less about what it is I'm moving from database to database. It's just data. It's all just 1s and 0s.
Friday, May 1, 2009
Informatica and the little things
I've always felt like it's the little things in life that matter. It's always the small things that either bite you in the ass, or - as they say, life's little pleasures. There doesn't seem to be any middle ground when it comes to little things. Take my 3 year old, for example. A couple of weeks ago, I asked her if Daddy was "bootylicious". She responded that she would not like to eat my booty. I can't blame her there. I wouldn't either.
Anyway, like butts, there are some things in Informatica that kind of stink, like Worklets. I don't like worklets. I never have. I'm not even sure I know what the point is. Early on, when I took over the system we have, the previous developers lumped every task together into logical groupings that they put into worklets. So, the workflow was literaly one worklet after another. And each worklet was single-threaded. There was nothing running in parallel. And because it was all in worklets, you couldn't restart the workflow if it died. So, maintaining the thing was a nightmare.
One of the first things I did was to rebuild the entire workflow. I scrapped every worklet and put everything into one gigantic workflow. Oh sure, this was complicated, but it meant that I could load multiple things at a time, and do completely unrelated things in parallel. It also meant that if something died, the rest of the stuff would continue on until the dependencies came back together again, but I knew I could always "restart workflow from task" at that point. Life became good. I also shaved about 4 hours off the total run time. Good stuff.
Today, the only worklet I have at all is a worklet which contains all the match/consolidate stuff we're doing with IDQ, and even then I only really even use that in development. The Informatica rep we hired to code it all for us put it all into a worklet, and I frowned, "What about restartability?", I asked - and he agreed it was a problem, so he took it out of the worklet.
My advice? Stay away from worklets.
There are other little things like this in Informatica that can make life hard or easy. Take Sequence Generators, for example. The default cached values of a sequence generator is 1,000. Which means if you write 5 rows into a reference table a day, you're going to end up with key values like 1,2,3,4,5,1000,1001,1002,1003,1004,2001,2002,2003,2004... something like that. Kind of ugly. Set it to 10 or 100 instead. Likewise, if you're writing a hundred thousand rows in a table a day, having a value like 1,000 is inefficient. Might as well make it 10,000. That's 90 fewer generations.
Little things.
Another little thing that new developers will often overlook is the "save session log for these runs" option under the "config object" tab of a session. I always set this to 14 - that saves two weeks of logs. Leaving this to be zero (the default) means you'll only always store just 1 version of your log - the current one. Bad idea. Better yet, edit the default_session_config (under tasks->session configuration) and change it permanently there so you don't have to think about it (and worse, forget about it) ever again.
Anyway, like butts, there are some things in Informatica that kind of stink, like Worklets. I don't like worklets. I never have. I'm not even sure I know what the point is. Early on, when I took over the system we have, the previous developers lumped every task together into logical groupings that they put into worklets. So, the workflow was literaly one worklet after another. And each worklet was single-threaded. There was nothing running in parallel. And because it was all in worklets, you couldn't restart the workflow if it died. So, maintaining the thing was a nightmare.
One of the first things I did was to rebuild the entire workflow. I scrapped every worklet and put everything into one gigantic workflow. Oh sure, this was complicated, but it meant that I could load multiple things at a time, and do completely unrelated things in parallel. It also meant that if something died, the rest of the stuff would continue on until the dependencies came back together again, but I knew I could always "restart workflow from task" at that point. Life became good. I also shaved about 4 hours off the total run time. Good stuff.
Today, the only worklet I have at all is a worklet which contains all the match/consolidate stuff we're doing with IDQ, and even then I only really even use that in development. The Informatica rep we hired to code it all for us put it all into a worklet, and I frowned, "What about restartability?", I asked - and he agreed it was a problem, so he took it out of the worklet.
My advice? Stay away from worklets.
There are other little things like this in Informatica that can make life hard or easy. Take Sequence Generators, for example. The default cached values of a sequence generator is 1,000. Which means if you write 5 rows into a reference table a day, you're going to end up with key values like 1,2,3,4,5,1000,1001,1002,1003,1004,2001,2002,2003,2004... something like that. Kind of ugly. Set it to 10 or 100 instead. Likewise, if you're writing a hundred thousand rows in a table a day, having a value like 1,000 is inefficient. Might as well make it 10,000. That's 90 fewer generations.
Little things.
Another little thing that new developers will often overlook is the "save session log for these runs" option under the "config object" tab of a session. I always set this to 14 - that saves two weeks of logs. Leaving this to be zero (the default) means you'll only always store just 1 version of your log - the current one. Bad idea. Better yet, edit the default_session_config (under tasks->session configuration) and change it permanently there so you don't have to think about it (and worse, forget about it) ever again.
Labels:
Informatica,
logs,
sequence generator,
worklets
Subscribe to:
Posts (Atom)