Big Data is shaking up everything, from education, economics, businesses and the sciences: the changes may be as big as the ones introduced by the printing press. As promoted, its biggest impact is that now we don’t need to research how to automate and teach a computer to do things: just inferring probabilities from big amounts of data is enough.
In the past, data collection, storing and analyzing methods were expensive and time consuming: in the year 2000, digital information was just one-quarter of the world’s stored information. Now we can easily capture and store ever-growing amounts of data: today, only 1% of all the stored information is non-digital, since the digital data is growing exponentially.
But behind the Big Data hype, there’s also Big Unawareness of statistical sciences:
- Big data may allow to cheat and work backward (data->analysis->conclusions from correlations), but correlation does not imply causation and the traditional scientific method is not to be forgotten. The same statistical error may be made on a grander scale.
- Statistical models and scientific understanding are yet needed, since more data brings more spurious patterns that obscure a constant number of the genuine insights: the signal to noise ratio quickly drops to zero without careful analysis. The mind frame of the researcher is as important as always: the only answers to be found are the ones that the researcher is looking for.
- More data doesn’t always mean more accuracy: the bigger the data set, the more likely it is to have errors and the higher the number of false positives inferred. More data may not cancel out errors and carefully sampled subsets may still outperform.
- Not everything can be captured, the question about what is missing is still there and sampling bias and error must still be considered: sampling bias is more impactful that sampling error, since there always the question of what underlying population has been captured by the data.
In the other words, Big Data does not equal Big Insights: science, deep reasoning and proper inferencing are as necessary as ever, and statisticians are beginning to modify and fine-tune their toolsets: as a remedy, I predict that tools from the Automated Reasoning field will also be increasingly adopted to fight this data avalanche.