Great IT Disasters

Wednesday, August 14, 2013

Vertica's hidden tables (and views!)

Wow, hello again blogosphere. Time for a shift in thinking here. Since I work for Vertica now, I'm going to use this space (hopefully regularly) to blog about all things Big Data and Vertica-related. So, let's dig right in with an interesting (to me!) Vertica-related topic around system tables.

As you probably aware, there is a table in Vertica called system_tables. It contains entries for 114 (in 6.1.2) tables. One of these tables is a table called data_collector. Data_collector has 149 rows. Each of these rows describes a table in the "dc" schema which is part of the "data collector" set of tables. These tables are used by the Management Console. They aren't documented and could (and do!) change between releases. That being said, you can find some interesting stuff in the data_collector tables.

But last, and certainly not least, is another undocumented table called "vs_system_tables". I'm going to go out on a limb here and speculate that the "vs" stands for "Vertica System". Other candidates include "Very Secretive" or "Veritable Smorgasbord". The jury is still deliberating.

vs_system_tables, like system_tables is a table which contains descriptions of ALL tables, hidden or otherwise. vs_system_table also includes all data_collector tables and system_table entries as well. So, if you're curious for a unique set for each, you'd want to do something like the following:

SELECT COUNT(*) FROM system_tables ;

COUNT

-------

114

(1 row)

SELECT * FROM data_collector ;
COUNT
-------
   149
(1 row)

SELECT COUNT(*) FROM vs_system_tables
WHERE table_name NOT ILIKE 'dc_%'
AND table_name NOT IN (SELECT table_name FROM system_tables);
count
-------
   116
(1 row)

That gives you a total of 379 tables. But not so fast - not all of these are actually tables. Several are actually views. Which ones, you ask? Well, the ones defined in the vs_system_views table. Of course.

select count(*) from vs_system_views ;
count
-------
   100
(1 row)
So, out of the 379 "tables" we've identified, 100 of them are actually views. Which begs the question - can we change them? Most definitely.

Let's take a look a the "columns" view.

select data_type, data_type_length, character_maximum_length from columns limit 5 ;
data_type | data_type_length | character_maximum_length
-------------+------------------+--------------------------
varchar(20) |               20 |                       20
varchar(20) |               20 |                       20
int         |                8 |
int         |                8 |
float       |                8 |
(5 rows)
Here, we can see that the columns table presents us with two different fields (among others) which define the length of the datatype. If the column is a character type, we have a second column which defines maximum length. This is kind of redundant. So, I'd like to modify the view to just show me "length" of the column, irrespective of whether it is a character field or non-character field. We could do this by modifying the view DDL in the vs_system_views table.

select * from vs_system_views where view_name = 'columns';

For brevity's sake, I'm not going to reproduce the results of that query here. But you can see that the table gives me the schema, and view_name, as well as a description and the "create view" DDL statement. But here are the columns in question:

    vcs."attlen" as "data_type_length",
    case when vcs.atttypid in (8, 9, 17, 117) and vcs.atttypmod <> -1 then vcs.attlen else null end as "character_maximum_length",
We can see here that "attlen" is being used in two different locations. I want to simplify this to just include it once and be done with it. So, we can eliminate the more complex second line, and just keep the "attlen" as "data_type_length" column. I'm going to go one more, and just name it "length" for simplicity.

With any view, you can run the query you've got without the "create view" portion of it to see if it gives you the results you want. If you like what you see, then build your view. For our test, we're going to give it a different name. In this case, "columns2".

create view "v_catalog"."columns2" as   select vcs."t_oid" as "table_id",
    vcs."nspname" as "table_schema",
    vcs."relname" as "table_name",
    vcs."systab" as "is_system_table",
    vcs."t_oid" || '-' || vcs."attnum" as column_id,
    vcs."attname" as "column_name",
    vcs."typname" as "data_type",
    vcs."atttypid" as "data_type_id",
    vcs."attlen" as "length",
    case when vcs.atttypid = 16 and vcs.atttypmod <> -1 then ((vcs.atttypmod-4)>>16) else null end as "numeric_precision",
    case when vcs.atttypid = 16 and vcs.atttypmod <> -1 then ((vcs.atttypmod-4) & 1023) else null end as "numeric_scale",
    case when vcs.atttypid = 12 and vcs.atttypmod <> -1 then vcs.atttypmod else null end as "datetime_precision",
    case when vcs.atttypid = 14 and vcs.atttypmod <> -1 then (vcs.atttypmod & 65535) else null end "interval_precision",
    vcs."attnum" as "ordinal_position",
    not vcs."attnotnull" as "is_nullable",
    case when vcs."attisidentity" then null else vcs."adsrc" end as "column_default",
    vcs."attisidentity" as "is_identity"
from "v_internal"."vs_columns" as vcs
where vcs."nspname" <> 'v_internal' and vcs."nspname" <> 'v_monitor' and vcs."nspname" <> 'v_catalog';
ROLLBACK 5447: View columns2 cannot be created under system schema v_catalog
Whoa! It certainly didn't like that. We can't modify or create entries in these system tables. No matter how much we beg or plead. We'll just have to create it as a regular plain ol' view.

create view "public"."my_columns" as   select vcs."t_oid" as "table_id",
    vcs."nspname" as "table_schema",
... etc.

That works. In theory, I could create "my" own versions of many of the system tables, if I'm inclined to make minor changes to them.

Enjoy!

Friday, September 10, 2010

Random input Informatica

A client of mine posed this interesting dilemna to me the other day.

"I have a project where I need to build a simple but reliable text file importer to drop records into [a table]. But it needs to be configurable so that the client can map where the fields go."

At first, I was stumped - No, I don't think Informatica can do that. That's very much outside the boundaries of what Informatica is designed to do. Obviously, Informatica works best when you have very defined things. It's designed to be a data forklift, after all. And forklifts work best when things reside on pallets. Forklifts tend to fail if you use them to move a stack of beanbags.

But the idea intrigued me. And ironically, my coworker faced a similar issue where a client was sending apparently ever-changing XML definitions for his source. So, he needed a similar solution. A proof of concept was in order.

I created the following table:
create table test_random (
pk int not null,
v1 int,
v2 int,
v3 int,
v4 int,
v5 int,
v6 int,
v7 int,
v8 int,
v9 int,
v10 int )

This is just a generic target table with 10 placeholder value fields, and one primary key column.

And created the following input data file:
key|v3|v9|v2|
1|2|3|4|
2|2|3|4|
3|2|3|4|
4|2|3|4|
5|2|3|4|
6|0|100|42

The file header here tells me where the data goes. For the purpose of the POC, I'm only dealing with inserts, but since a primary key is supplied in my data file, I could easily do a lookup, and process updates as needed. That solution is well known, however, so I didn't bother.

Output:
select * from test_random
pk v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
1 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
2 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
3 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
4 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
5 NULL 4 2 NULL NULL NULL NULL NULL 3 NULL
6 NULL 42 0 NULL NULL NULL NULL NULL 100 NULL

(6 row(s) affected)

Success! But how did I do it?

First off, I created two Source qualifiers for my same source - my data file. My intention here is to grab the header with one source, and the detail records for the other source. I created two separate source definitions. For the detail source, I told it to skip the first row. Easy enough. But there is no easy way in Informatica to do the reverse - only read n number of rows. To do this, I created a sequence transformation in that pipeline, and checked the "Reset" option. This resets the sequence for every session run back to zero. The reset option isn't available on re-usable sequence generators, so don't set this as reusable.

Next, I created a filter here that eliminated any row that didn't have a "nextval" from the sequence generator of <= 1. This means that only the very first row I read passes through my filter.

Next, I JOIN the data together. What I actually want here is a full cartesian join. When you cartesion two things together, you get x*y rows as a result, where x and y are the number of rows in your two sources. So, if I had 5 rows in one thing, and 100 in another, I'd end up with 500 cartesion results. In my case, since I only have 1 header, I end up with 1*n rows of data. Only problem is, the joiner transformation doesn't support Cartesion joins - inner, left, right and full, certainly - but not cartesion. What kind of idiot would want that?!

I do- to do it, I created static values of "1" in both pipelines, and used that as my join criteria. It's simply a bogus column that joins every time. Voila - cartesion join.

Now that I have my header "v3, v9, v2", etc. associated with every detail record in my file, I can now apply the bulk of my logic.

I assign a unique port to every port in my target.

Each of these ports is an output-only port containing the following code:

IIF(rtrim(header1) = 'v1', to_integer(detail1),
IIF(rtrim(header2) = 'v1', to_integer(detail2),
IIF(rtrim(header3) = 'v1', to_integer(detail3), NULL)))

IF the value of my header matches the column name associate with the port the value is linked to, then grab the value of that ports information. Obviously, the more input columns you have, the more nesting that is required in this logic. And this block of code (with the static value being the only difference each time), gets deeper and deeper the more columns you have. But we have to evaluate every source column because the value for column 1 in my table could be in the 50th column in my source file. I must check every column until I find one that might match.

Wednesday, May 5, 2010

Adding attachments to vmailer

I use Informatica on a Windows platform, which means I've become pretty good at writing DOS Batch scripts. I'm not a huge fan of this, mind you, but I do what I have to in order to solve the problems I face.

One such problem involves sending an email for error notification from within a Batch file. You know - you've got a Batch script which SFTP's to a remote server to pick up a file from a vendor, file doesn't exist (or is zero bytes in size) and then your job happily processes nothing. Bad idea. So, you need some sort of validation so that you can know to intervene before the client angrily calls you.

One program I've found that I like a great deal is called vmailer.exe. It's a freeware, command line email utility. It's about a simple a program as you'll find (you get what you pay for), but for what I need, it works perfectly.

Except that the documentation is extremely limited, and the website's own FAQ is particularly useless.

By default, vmailer works as follows:
c:\vmailer.exe

V-Mailer v1.2.1 (www.virdi-software.com)

Usage: VMailer.exe Filename SmtpServer To From User Password

where:
FileName = text file containing the messsage
SmtpServer = server name or ip address e.g. smtp.domain.com
To = recipient's address e.g. him@domain.com
From = sender's address e.g. me@domain.com
User = (optional) user name if authentication required
Password = (optional) password if authentication required

Most of this is pretty self-explanatory. The user/password at the end I've never used. So, that leaves just 4 options. The second one might involve an email to your sysAdmin to ask what the exchange server is. If you know where to find this, you can dig through Outlook and get this information as well. The To and From are the email addresses, and I'm pretty sure you can put whatever you want in the From line.

That leaves the "filename". In its simplest form you could put something like this in your filename:
"Hello world"

That would show up in the body of your message. But your Subject line would be blank. The vmailer FAQ is helpful in solving this problem, however, and suggests you need to do some formatting of your filename in order to get a subject line in there. This suggests that there is some intelligence in this file, and it's not merely the contents of your email body.

That gives us something a bit more complex:
From: Curtis
To: Valued Recipients
Subject: This is a Subject line.

This is a test.

That's better! Now we get a subject line. And a bit more power over the "To" and "From" which you can play with if such things amuse you. (they do me)

But most recently, I wanted to be able to add an attachment to the file. This proved harder, and the web proved particularly unkind to me in solving this dilemna. I did find a good site which finally revealed a simple solution, but I had to dig and dig for it, which is why I decided to put it here for my own sake, if not for the sake of others trying to find similar information.

MIME-Version: 1.0
Content-Type: text/plain ; name="test_20100505.csv"
Content-Transfer-Encoding: Base64
Content-Disposition: attachment; filename="test_20100505.csv"
From: Curtis
To: Valued Recipients
Subject: This is an Attachment Test

This is a test.

Now, I'm not going to sit here and pretend that I know what all of that stuff means exactly, but I do know that there are multiple content-types, and if you google "MIME content-types" you can get a list of all the types out there that a MIME email accepts.

Obviously, this is a bit outside the scope of Vmailer, since at this point we're talking about MIME, and no longer about vmailer. But if you're like me, you enjoy simple solutions. :)

Thursday, March 18, 2010

Badfiles

Whenever rows get rejected from an insert/update/whatever, they get written to the badfiles. This is kind of like "time out" for naughty records. Unlike naughty children, however, you can choose to completely ignore these records for all eternity. That might not be what you want to do, however.

Assuming you care what's in that file, you'll need to read it. Thing is, it's kind of a convoluted mess upon close examination. Informatica creates a dump of your record plus a bunch of other stuff into a comma-delimited flat file. You can read this file like any other source record, except it's not the same as your failed session's target file definition.

In fact, it looks a little something like this:
0,D,26,D,,N,138,D,,N,,N,,N,,N,7,D,,N,,N,1,D....

The first column is a row_indicator (smallint). It contains values from 0 to 9. The most common being 0 or 1 which means the row was either inserted or updated and failed.

Then, there is a column indicator. The first one of this is always a "D", which means it's valid data.

Then you get every column that was written to your previous target that failed, followed by a column indicator after every column.

So, if you tried to write to a table with 3 columns A, B and C, and failed on a unique constraint violation, your badfiles table will look like this:

0,D,A,D,B,D,C,D

Where all the "D" values are column indicators indicating that the columns were valid data, and the 0 indicates that it was an insert. In the event that a column was null, you'll two commas (,,) followed by an "N" for Null.

For more information on badfiles, consult Informatica's excellent help. This topic is listed under "reject files" - you'll have a hard time searching for it under "badfiles".

Thursday, February 18, 2010

Convert UNIX epoch time in Informatica

Handy function for converting UNIX epoch time in Informatica. UNIX epoch time is a int(11) field which contains the number of elapsed seconds since midnight 1/1/1970.

To convert to a GMT date in Informatica, I created a mapplet. The mapplet receives the EPOCH_TIME as a variable.

IIF(EPOCH_TIME > 0,
ADD_TO_DATE(to_date('19700101', 'yyyymmdd'),'SS',EPOCH_TIME), NULL)

The original version I had created of this lacked the IIF test. So, in cases where epoch time was 0, I was getting 1/1/1970 as my result. Which I didn't want.

Enjoy.

Tuesday, December 15, 2009

Informatica for free

Saw this today, and thought I'd share.

If you are interested in installing Informatica PowerCenter (server + client) on your Home PC/ Laptop, here is how you can do it. The copy comes with a Development License http://www.oracle.com/technology/software/popup-license/standard-license.html that allows you to use full version of the product at no charge for strictly self-educational purposes.

Go to http://edelivery.oracle.com/

Fill in the form in the next page.

In the Media Pack Search page, search for "Oracle Business Intelligence"

Download the version that you want to install.

Notes:
You may also want to download an Oracle database.
These pages will help you in the installation process:

http://download.oracle.com/docs/cd/E10783_01/doc/bi.79/e10742/anyinstadmwininstall.htm#BABBHJDH

http://gerardnico.com/wiki/dw/etl/informatica_powercenter/powercenter_installation

Monday, October 19, 2009

As I live and breathe

Whoever wrote this should have their programmer license revoked. Oh, wait, licenses aren't required to be programmers. Well, maybe we should start requiring licenses. The first question on that test should be "when is it OK to include a 'SELECT *' in production code, and if the answer is anything other than 'NEVER', the programmer should be shot.

Actual production SQL code follows:

SELECT P.WMS_SHIP_DT, P.DC_LOCATION_ID, P.TO_LOCATION_ID, P.PRODUCT_ID, P.PO_NBR, Z.DEMAND_DT
, NVL(Z.SOURCE_VENDOR_ID, 1802) AS SOURCE_VENDOR_ID, Z.VENDOR_ID
, CASE WHEN NVL(Z.DPR_PROJ_ORDER_QTY,0) > NVL(P.WMS_ORIG_ORD_QTY,0) THEN NVL(Z.DPR_PROJ_ORDER_QTY,0)
ELSE NVL(P.WMS_ORIG_ORD_QTY,0) END AS DEMAND_QTY
, NVL(Z.DPR_PROJ_ORDER_QTY,0) AS DPR_PROJ_ORDER_QTY, NVL(P.PO_ORIG_ORDER_QTY,0) AS PO_ORIG_ORDER_QTY
, NVL(P.PO_ORDER_QTY,0) AS PO_ORDER_QTY, NVL(P.WMS_UNIT_PRICE,0) AS WMS_UNIT_PRICE
, NVL(P.WMS_ORIG_ORD_QTY,0) AS WMS_ORIG_ORD_QTY, NVL(P.WMS_SHIPPED_QTY,0) AS WMS_SHIPPED_QTY
, NVL(P.WMS_CANCEL_QTY,0) AS WMS_CANCEL_QTY, NVL(P.OUT_OF_STOCK_QTY,0) AS OUT_OF_STOCK_QTY
FROM (SELECT O.WMS_SHIP_DT, O.DC_LOCATION_ID, O.TO_LOCATION_ID, O.PO_NBR, O.PRODUCT_ID,
MAX(T.PO_DOC_DT) AS PO_DOC_DT, MAX(T.PO_ORIG_ORDER_QTY) AS PO_ORIG_ORDER_QTY,
MAX(T.PO_ORDER_QTY) AS PO_ORDER_QTY, MAX(NVL(O.WMS_UNIT_PRICE,0)) AS WMS_UNIT_PRICE,
SUM(O.WMS_ORIG_ORD_QTY) AS WMS_ORIG_ORD_QTY, SUM(O.WMS_SHIPPED_QTY) AS WMS_SHIPPED_QTY,
SUM(O.WMS_CANCEL_QTY) AS WMS_CANCEL_QTY,
CASE WHEN SUM(O.WMS_ORIG_ORD_QTY - O.WMS_SHIPPED_QTY - O.WMS_CANCEL_QTY) >= 0
THEN SUM(O.WMS_ORIG_ORD_QTY - O.WMS_SHIPPED_QTY - O.WMS_CANCEL_QTY)
ELSE 0 END AS OUT_OF_STOCK_QTY
FROM EDW_STANDBY..WMS_OUTBOUND O
LEFT OUTER JOIN (SELECT PH.PO_DOC_DT, PH.PO_NBR, PL.PRODUCT_ID, PL.LOCATION_ID
,PL.PO_ORIG_ORDER_QTY, PL.PO_ORDER_QTY
FROM EDW_STANDBY..PO_HDR_FACT PH, EDW_STANDBY..PO_LN_FACT PL
WHERE PH.PO_MSTR_DUE_DT BETWEEN CURRENT_DATE - 75 AND CURRENT_DATE + 30
AND PH.PO_TYPE_ID = 'UB'
AND PH.PO_GID = PL.PO_GID
AND PH.PO_MSTR_DUE_DT = PL.PO_MSTR_DUE_DT) T
ON O.TO_LOCATION_ID = T.LOCATION_ID AND O.PO_NBR = T.PO_NBR AND O.PRODUCT_ID = T.PRODUCT_ID
WHERE O.WMS_SHIP_DT > CURRENT_DATE - 35
AND O.PO_NBR IS NOT NULL
GROUP BY O.WMS_SHIP_DT, O.DC_LOCATION_ID, O.TO_LOCATION_ID, O.PO_NBR, O.PRODUCT_ID) P
LEFT OUTER JOIN (SELECT *
FROM WMS_DEMAND_STORE_DAY
WHERE DEMAND_DT > CURRENT_DATE - 60
AND SHIP_SEQ_NBR = 1) Z
ON NVL(P.PO_DOC_DT, TO_DATE('12/31/9999', 'MM/DD/YYYY')) = Z.DEMAND_DT
AND P.TO_LOCATION_ID = Z.TO_LOCATION_ID
AND P.DC_LOCATION_ID = Z.DC_LOCATION_ID
AND P.PRODUCT_ID = Z.PRODUCT_ID