It's hard. So very, very hard. In a perfect world, all the source systems would have universal keys, adhere to a global conceptual model, and have perfect data quality. HA! If only we lived in a perfect world.
We had a minor victory at work this week.
We have three (count 'em - three) sources of location data. This is store data. Three sources. One is considered fairly primary, another exists for reporting purposes, and a third exists in one of the three data warehouses that we have. The data warehouse I work on pulls data from two non-data warehouse sources. And ... of course, they're slightly different. So, we have a lot of code devoted to figuring out which one is right, and which one gets the higher preference, and understanding our location table (which, I guess in all fairness, is a fourth source), takes a fair bit of knowledge, since it contains data from two separate sources. It's kind of a mess.
Made all the more complicated by the fact that sometimes we receive transactions which aren't tied to any valid location at all. Some of our licensee locations have the ability to just put whatever location ID they want to in the transmission files. Sometimes those store managers want to report on separate counter agents, so they'll actually assign each person in the store a unique store ID!! Put that in your pipe and smoke it, data quality!
Crazy.
Anyway, back to the minor victory. We finally (we being the data management team in conjunction with certain clients) agreed that we could automate the process of bringing the primary source of location data into the data-warehouse. This isn't location data as much as it is hierarchy data. But, close enough. This is a big win because, get this - the primary source of this table currently is an email that gets sent out notifying downstream systems of a change in the location hierarchy, someone in production support has to create a ticket, and then updates the table manually. Whoa. What a mess.
As you can imagine, the table was more than a little different from the original source, but everyone involved agreed that the original source was "more better" so we decided to proceed with implementing the code that I wrote a couple of months ago. The code was tricky*, but I got it to work - but when it came time to implement it, I was like "Wow, this is a pretty big change. We may need to discuss this one first." Once everyone finally got on board, we can move forward with that code.
Minor victory. Every little bit helps.
* I had to read from a de-normalized Oracle table which was literally modeled after a COBOL flat file. So, I had to recurse through the table in order to build out the hierarchies. It was kind of ugly.
Thursday, April 30, 2009
Wednesday, April 29, 2009
Informatica & Directory structures
This is kind of a big topic, but I felt like it was important to talk about.
For the purposes of this discussion, I'm going to use the word "folder" to describe a logical placeholder within Informatica itself, and "directory" to describe to a path on a hard drive within an operating system.
In other words, mappings are stored in folders, and files are stored in directories. Got it? Good.
When you create a new folder using the Repository Manager in Informatica, it just creates the folder within Informatica - no actual directories are stored. Which means if you're writing a logfile somewhere, it's just going to get written to some default location. That probably varies based on installation, but in the environment I use, it's:
E:\Informatica\Powercenter8.6.1\server\infa_shared
Now, this isn't terribly useful, and it's quite cluttered. Especially if you have hundreds of sessions writing logs into the same folder. Not pretty. But that might not be horrible if you just had logs - but Informatica stores all kinds of files - input files, bad files, target files, you name it.
Fortunately, within the above path, Informatica creates a default cache directory (in the above named directory .\cache, for example). This is handy, but still not ideal.
A better strategy is to create a specific directory for every folder within Informatica. This keeps things organized. But you have to go one step further, and you actually have to modify your workflow and session properties to write to these specific directories. Not that big of a deal. You just have to be diligent with it.
We use the following sub-directories within the main root directory of a folder. So, if we have a folder called "Sales", we'll create a .\Informatica\Sales\ directory (where you put this is up to you), and then sub-directories below that. We use:
Backup\
BadFiles\
Cache\
Docs\
Export\
ExtProc\
Headers\
Parameter\
Scripts\
SessLogs\
Sql\
SrcFiles\
Temp\
TgtFiles\
WorkflowLogs\
Most of these were defined before I started here, so I'm not sure of the reasoning behind a few of them. I'm sure the previous admin's intentions were noble, but in reality, we only really use a few of these. Of these, Cache\, SessLogs\ and WorkflowLogs\ are directories every workflow is going to need. Unless you're writing out a reject file, you don't need BadFiles\. Backup\ would be a handy place to store data files, but I'm in favor of keeping data files in a completely separate location - which also means the value of SrcFiles\ and TgtFiles\ are called into question. Honestly, I only ever use those when I'm testing. So, they do come in kind of handy. Docs\ is questionable. I can't imagine anyone ever looking for documentation here. Storing it online is a far better option. If you ever export your mappings (perhaps as a backup?) Export\ would be a good choice. Of course, this makes Backup\ completely redundant. I have no idea what ExtProc\ was supposed to be for. Headers\ is useful if you're writing to files that need headers. That's where I store my headers. Parameter\ is also handy if you're workflow is using a parameter file. Scripts\ and Sql\ are both questionable. First off, I don't know an Informatica mapping would need a sql file, and any kornshell or batch scripts you have that might be called by the workflow, are probably better off being stored somewhere else. Lastly, Temp\ is also kind of useless. Informatica already has a Cache\ directory, so Temp\ is redundant.
That gives us the following, useful relevant directories:
Badfiles\ (optional, perhaps)
Cache\
Export\ (optional)
Headers\ (optional)
Parameter\ (optional)
SessLogs\
WorkflowLogs\
If you insist on having a SrcFiles\ and TgtFiles\ folder, I'd recommend a single location shared by all workflows. That way, when you're drive fills up because someone decided to test a 14 gb input file, you can find them easily, without having to scan through a bunch of subdirectories looking for candidate files to remove. This also makes it a little easier for workflows to share some common data pool. Like, if two workflows read from the same input, but exist in different folders, for example.
Lastly, I would be remiss if I didn't state the obvious, and that is the drive or mount hosting all these directories should be large - large enough for all those Cache\ directories. Alternatively, you could create a special mount or drive to just hold Cache\ files. If that place is too small, and fills up, your mapping will die while creating lookup files. That's bad.
Also, session logs can become VERY large if they produce a lot of errors, or if verbose logging is turned on. So, be sure to clean those up, if they're not needed.
For the purposes of this discussion, I'm going to use the word "folder" to describe a logical placeholder within Informatica itself, and "directory" to describe to a path on a hard drive within an operating system.
In other words, mappings are stored in folders, and files are stored in directories. Got it? Good.
When you create a new folder using the Repository Manager in Informatica, it just creates the folder within Informatica - no actual directories are stored. Which means if you're writing a logfile somewhere, it's just going to get written to some default location. That probably varies based on installation, but in the environment I use, it's:
E:\Informatica\Powercenter8.6.1\server\infa_shared
Now, this isn't terribly useful, and it's quite cluttered. Especially if you have hundreds of sessions writing logs into the same folder. Not pretty. But that might not be horrible if you just had logs - but Informatica stores all kinds of files - input files, bad files, target files, you name it.
Fortunately, within the above path, Informatica creates a default cache directory (in the above named directory .\cache, for example). This is handy, but still not ideal.
A better strategy is to create a specific directory for every folder within Informatica. This keeps things organized. But you have to go one step further, and you actually have to modify your workflow and session properties to write to these specific directories. Not that big of a deal. You just have to be diligent with it.
We use the following sub-directories within the main root directory of a folder. So, if we have a folder called "Sales", we'll create a .\Informatica\Sales\ directory (where you put this is up to you), and then sub-directories below that. We use:
Backup\
BadFiles\
Cache\
Docs\
Export\
ExtProc\
Headers\
Parameter\
Scripts\
SessLogs\
Sql\
SrcFiles\
Temp\
TgtFiles\
WorkflowLogs\
Most of these were defined before I started here, so I'm not sure of the reasoning behind a few of them. I'm sure the previous admin's intentions were noble, but in reality, we only really use a few of these. Of these, Cache\, SessLogs\ and WorkflowLogs\ are directories every workflow is going to need. Unless you're writing out a reject file, you don't need BadFiles\. Backup\ would be a handy place to store data files, but I'm in favor of keeping data files in a completely separate location - which also means the value of SrcFiles\ and TgtFiles\ are called into question. Honestly, I only ever use those when I'm testing. So, they do come in kind of handy. Docs\ is questionable. I can't imagine anyone ever looking for documentation here. Storing it online is a far better option. If you ever export your mappings (perhaps as a backup?) Export\ would be a good choice. Of course, this makes Backup\ completely redundant. I have no idea what ExtProc\ was supposed to be for. Headers\ is useful if you're writing to files that need headers. That's where I store my headers. Parameter\ is also handy if you're workflow is using a parameter file. Scripts\ and Sql\ are both questionable. First off, I don't know an Informatica mapping would need a sql file, and any kornshell or batch scripts you have that might be called by the workflow, are probably better off being stored somewhere else. Lastly, Temp\ is also kind of useless. Informatica already has a Cache\ directory, so Temp\ is redundant.
That gives us the following, useful relevant directories:
Badfiles\ (optional, perhaps)
Cache\
Export\ (optional)
Headers\ (optional)
Parameter\ (optional)
SessLogs\
WorkflowLogs\
If you insist on having a SrcFiles\ and TgtFiles\ folder, I'd recommend a single location shared by all workflows. That way, when you're drive fills up because someone decided to test a 14 gb input file, you can find them easily, without having to scan through a bunch of subdirectories looking for candidate files to remove. This also makes it a little easier for workflows to share some common data pool. Like, if two workflows read from the same input, but exist in different folders, for example.
Lastly, I would be remiss if I didn't state the obvious, and that is the drive or mount hosting all these directories should be large - large enough for all those Cache\ directories. Alternatively, you could create a special mount or drive to just hold Cache\ files. If that place is too small, and fills up, your mapping will die while creating lookup files. That's bad.
Also, session logs can become VERY large if they produce a lot of errors, or if verbose logging is turned on. So, be sure to clean those up, if they're not needed.
Tuesday, April 28, 2009
Appending a header onto a flat file in Informatica
A few weeks ago, I had a project whereby I needed to append a header onto a flat file using Informatica. I'd come across a few posts regarding this topic, but most of them were people asking how to do it - and struggling with a solution.
I don't know what the "preferred" solution is within Informatica, and I'm certain that there are probably multiple "correct" ways in which this can be accomplished (some of the clunkier than others, I'm sure), but the following is a way in which I got it to work, and it's pretty elegant.
Also, you should know that I tried about 20 variations on this before I finally got it to work. So, if you're struggling with this, just try this method. A lot of other ways you think might work, won't actually work. So, this might save you some time and headache.
Also, keep in mind I'm using 8.6.1, so which options you have in your version (if your version is significantly different) might differ slightly.
Ok, brace yourself, this is pretty complicated...
First, write to a flat file target in your mapping. That much should be obvious.
But anyway - that it. Piece of cake.
I don't know what the "preferred" solution is within Informatica, and I'm certain that there are probably multiple "correct" ways in which this can be accomplished (some of the clunkier than others, I'm sure), but the following is a way in which I got it to work, and it's pretty elegant.
Also, you should know that I tried about 20 variations on this before I finally got it to work. So, if you're struggling with this, just try this method. A lot of other ways you think might work, won't actually work. So, this might save you some time and headache.
Also, keep in mind I'm using 8.6.1, so which options you have in your version (if your version is significantly different) might differ slightly.
Ok, brace yourself, this is pretty complicated...
First, write to a flat file target in your mapping. That much should be obvious.
- Create a .txt file that has your header in it. Be sure to put a carriage return at the end, otherwise your data will append to the end of that line. And be sure not to have empty rows that follow, otherwise you'll have a blank line between your header and your data.
- Go to the Mapping tab of your Session, and click on your flat file target.
- For "Header Options" Choose "Use header command output"
- For the "Header Command" put "type " and then a path where your header file is located.
- Um.. that's it.
But anyway - that it. Piece of cake.
Monday, April 27, 2009
At work, we use Informatica. In fact, my team is on the bleeding edge of Informatica - we're using 8.6.1. Which is so new, it didn't even work right when the vendor tried to install it (it wasn't compatible with the Data Quality piece), and they had to reconfigure it (they being Informatica technical support), in order to get it to work. That's how on the bleeding edge we are. You can't get much more bleeding edge than that.
Since Informatica will be a topic du jour for this blog, I'll start out by giving a few Informatica tips. First off:
Don't believe that Informatica can do everything. Quite simply - it can not. For example, Informatica is exceedingly bad at deleting data. Really, really quite bad, in fact. In a word - atrocious. Don't do it. At work, we have a workflow that runs "daily" which attempts to purge old data. At last count, said job had been running for 9 days. That means, on the odd chance that it actually finishes, when it starts up again, it's going to be 9 days behind.
A co-worker once gave me a bit of wisdom. He said "A daily job should finish in under 24 hours." He was a smart guy.
Informatica also isn't great at dealing with file manipulation. It's quite clumsy in that space. You're better off using DOS Batch files (I cringe everytime I say that), or (if you're more lucky than I) UNIX kornshell scripts (or whatever flavor of script you prefer). Much better option, and much more flexible and powerful.
Struggling with a problem in Informatica? Drop me a line - I won't guarantee that I know the answer, but I'll give it the ol' college try - which means I'll just steal someone else's answer from the internet.
Until next time...
Since Informatica will be a topic du jour for this blog, I'll start out by giving a few Informatica tips. First off:
Don't believe that Informatica can do everything. Quite simply - it can not. For example, Informatica is exceedingly bad at deleting data. Really, really quite bad, in fact. In a word - atrocious. Don't do it. At work, we have a workflow that runs "daily" which attempts to purge old data. At last count, said job had been running for 9 days. That means, on the odd chance that it actually finishes, when it starts up again, it's going to be 9 days behind.
A co-worker once gave me a bit of wisdom. He said "A daily job should finish in under 24 hours." He was a smart guy.
Informatica also isn't great at dealing with file manipulation. It's quite clumsy in that space. You're better off using DOS Batch files (I cringe everytime I say that), or (if you're more lucky than I) UNIX kornshell scripts (or whatever flavor of script you prefer). Much better option, and much more flexible and powerful.
Struggling with a problem in Informatica? Drop me a line - I won't guarantee that I know the answer, but I'll give it the ol' college try - which means I'll just steal someone else's answer from the internet.
Until next time...
Welcome to my blog. My intention with this blog is to remain focused on topics of interest to the IT community, specifically as it pertains to Data Warehousing, which is my particular area of expertise (looks, it's so important, I even capitalized it!). And, I like stories - specifically IT stories. So, I may share a funny story or anecdote from time to time. And if you have an It related story, please share it. I welcome your feedback.
Subscribe to:
Posts (Atom)