Back in 2002, the members of the Convention on Biological Diversity made this rationalisation of species’ names their number one priority. It has taken three years to trim the one million listings down to 400,000.
For researchers looking into plants that could be used for food or medicines, the existing database was less than helpful. One example was a relative of the basil plant, Plectranthus. Looking for it using only the most-commonly used name would miss 80 per cent of the information about that plant family.
Sounds familiar? Data redundancy affects every database, whatever it is used for. In the marketing world, there is a near obsession about retaining every record and variable that has been collected, regardless of its utility and completeness. Data deletion policies are almost never written or enacted.
Yet a proper spring clean can yield significant benefits. For one thing, it makes marketing plans more accurate. Assume you have one million customers when the reality is half that number and your targets are nearly impossible to hit. You are also wasting a lot of resource going after the same people twice, who may then become less likely to purchase because of this duplication.
Regulatory strictures about maintaining files which can not be easily matched are one of the reasons for this hesitancy about deduplication. Since so many database managers and much of the technology they use originated within financial services companies, the culture has become standardised.
It takes a very bold company to say that it is going to nett down its customer base. City analysts might not take too kindly to it and could mark the share price down. That fear alone is enough to stop many data practitioners from doing what they know they should.
But as the shake-out of over-inflated asset prices comes to its tail end, the customer database may prove to be one place where there is still an exaggerated picture being painted of the business. Like accepting that six out of ten plants are basically the same as are described by four out of ten names, data needs to check that it is firmly rooted in reality.