A client of mine posed this interesting dilemna to me the other day.
"I have a project where I need to build a simple but reliable text file importer to drop records into [a table]. But it needs to be configurable so that the client can map where the fields go."
At first, I was stumped - No, I don't think Informatica can do that. That's very much outside the boundaries of what Informatica is designed to do. Obviously, Informatica works best when you have very defined things. It's designed to be a data forklift, after all. And forklifts work best when things reside on pallets. Forklifts tend to fail if you use them to move a stack of beanbags.
But the idea intrigued me. And ironically, my coworker faced a similar issue where a client was sending apparently ever-changing XML definitions for his source. So, he needed a similar solution. A proof of concept was in order.
I created the following table:
create table test_random (
pk int not null,
v1 int,
v2 int,
v3 int,
v4 int,
v5 int,
v6 int,
v7 int,
v8 int,
v9 int,
v10 int )
This is just a generic target table with 10 placeholder value fields, and one primary key column.
And created the following input data file:
key|v3|v9|v2|
1|2|3|4|
2|2|3|4|
3|2|3|4|
4|2|3|4|
5|2|3|4|
6|0|100|42
The file header here tells me where the data goes. For the purpose of the POC, I'm only dealing with inserts, but since a primary key is supplied in my data file, I could easily do a lookup, and process updates as needed. That solution is well known, however, so I didn't bother.
Output:
select * from test_random
pk v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
1 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
2 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
3 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
4 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
5 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
6 NULL 42 0 NULL NULL NULL NULL NULL 100 NULL
(6 row(s) affected)
Success! But how did I do it?
First off, I created two Source qualifiers for my same source - my data file. My intention here is to grab the header with one source, and the detail records for the other source. I created two separate source definitions. For the detail source, I told it to skip the first row. Easy enough. But there is no easy way in Informatica to do the reverse - only read n number of rows. To do this, I created a sequence transformation in that pipeline, and checked the "Reset" option. This resets the sequence for every session run back to zero. The reset option isn't available on re-usable sequence generators, so don't set this as reusable.
Next, I created a filter here that eliminated any row that didn't have a "nextval" from the sequence generator of <= 1. This means that only the very first row I read passes through my filter.
Next, I JOIN the data together. What I actually want here is a full cartesian join. When you cartesion two things together, you get x*y rows as a result, where x and y are the number of rows in your two sources. So, if I had 5 rows in one thing, and 100 in another, I'd end up with 500 cartesion results. In my case, since I only have 1 header, I end up with 1*n rows of data. Only problem is, the joiner transformation doesn't support Cartesion joins - inner, left, right and full, certainly - but not cartesion. What kind of idiot would want that?!
I do- to do it, I created static values of "1" in both pipelines, and used that as my join criteria. It's simply a bogus column that joins every time. Voila - cartesion join.
Now that I have my header "v3, v9, v2", etc. associated with every detail record in my file, I can now apply the bulk of my logic.
I assign a unique port to every port in my target.
Each of these ports is an output-only port containing the following code:
IIF(rtrim(header1) = 'v1', to_integer(detail1),
IIF(rtrim(header2) = 'v1', to_integer(detail2),
IIF(rtrim(header3) = 'v1', to_integer(detail3), NULL)))
IF the value of my header matches the column name associate with the port the value is linked to, then grab the value of that ports information. Obviously, the more input columns you have, the more nesting that is required in this logic. And this block of code (with the static value being the only difference each time), gets deeper and deeper the more columns you have. But we have to evaluate every source column because the value for column 1 in my table could be in the 50th column in my source file. I must check every column until I find one that might match.
Friday, September 10, 2010
Subscribe to:
Posts (Atom)