Lies, statistics and the truth about data analysis

Reports of made-up census data and fears of high opt-out rates from the electoral roll should not compromise the integrity of data analysis, argues Paul Gage

Those whose job involves the use of statistical data may have been concerned at two articles in the national press last week. One, about the 2001 UK census, revealed that “civil servants have made up personal details for at least 1 million people” to compensate for gaps in the data. The second concerned the electoral roll and the new “edited register”. Consumers now have the opportunity to opt out of the edited register, the only version of the database that will be available to commercial organisations.

Two things immediately struck me about these articles. First, if the Office for National Statistics (ONS), stalwart of procedural correctness and epitome of trust and reliability, is “inputting” information about individuals, whom, if anyone, can we trust to provide accurate data? At least two per cent of the ONS census data is constructed with a technique described as “experimental”, which has many regulatory and financial implications. What does this say about the rest of its data?

The second point these articles brought home to me was the importance of this data to the Government and direct marketers.

The Government, according to the piece in The Guardian, uses the official regional headcounts to decide on funding levels for schools, GPs and other local services. While it’s fair to say the techniques employed by the ONS will probably give a good representation of reality, the fact remains that fundamental data, on which correct national administration depends, is incomplete.

Direct marketers, and especially companies that depend on the electoral roll for their segmentation products, such as Claritas and Experian, are in a similar position to the Government in that data which is so such fundamental to them is subject to significant limitations. The implications are potentially worse in marketing, given that the opt-out rate is likely to be well over the two per cent level of missing data that concerns the Government. This will doubtless have an impact on the robustness of their models, and the level of confidence that can be drawn from them.

It seems, therefore, that even the most reliable sources turn out to be not completely dependable. The most important data can look deceptively complete. So what data, if any, can be trusted fully and unconditionally?

Well, perhaps this is missing the point. All data-driven marketing purports to be rooted exclusively in fact, but this is not actually the case. Like a model, data are a representation of the world – not the real thing. I would argue, therefore, that no data can be trusted to be fully and unconditionally correct, ever. Rather, trust is best placed in those who use and interpret the data, who are fully aware of all the surrounding implications, and can draw conclusions that are both balanced and intelligent. And, importantly, reached through analysis that is as honest and transparent as possible.

Paul Gage is a data analyst at Carat