We've had a couple of interesting discussions this week regarding data quality. One of the major areas that we've been dealing with lately has to do with email addresses. We had a problem last Friday pulling a list of email addresses for a marketing campaign, and a couple of the records were obviously bogus (something like notvalidemail@acme-hackme.com) and then as an added bonus, whoever created them on the website added a carriage return to it. So, not only was the email bogus, but it ended up crashing my job because it threw the line count off.
I made the suggestion to the web team that they filter this kind of stuff out - turns out they do, but this particular record was loaded as part of a mass import of data from our 3rd party vendor which used to manage that stuff.
In investigating these records, it occurred to me that there could be a whole level of data quality "corrections" that could be made to the data that would make it better. For example, not only do we want to exclude the bad ones, but we also want to encourage users to correct the good ones - because if you're specifically signing up for an email alert, then it doesn't do you any good if the email address isn't valid at all.
One of the examples I'd discovered in our database were things like joe2yahoo.com or joe@yahoo.co, - where they fingered the comma instead of the m - or even cp, where their hand was off just slightly (conversely, cim or cin were also present). The joe2yahoo example resulted when someone didn't hold the shift key down, and got a 2 instead of an @.
I suggested that these patterns could be matched on, and then replaced, fixing the email address, but the web developers were all very hesitant - "What if joe@yahoo.com isn't even their real email address?", they asked, "You'd be turning an intentionally bogus email address into a valid one, and then we'd be spamming poor Joe who never even asked for our emails.
In fairness, I suspect if joe@yahoo.com is a valid email address, he probably gets a ton of spam anyway.
But their argument had some merit. So yes, it would seem that sometimes, data cleansing can sometimes have unforeseen side-effects.
Friday, May 29, 2009
Subscribe to:
Post Comments (Atom)
Hi Curtis
ReplyDeleteThere is a free online informatica quiz available at
http://tinyurl.com/infaquiz
I will appreciate if you can take the quiz and check your Informatica IQ and provide any suggestions.
By the way, I am a regular reader of your blog and I really like your posts.
Continue the good work.
Divya
www.bidw.co.in
Thanks, Divya. I'll check it out.
ReplyDelete