Big Data is shaking up everything, from education, economics, businesses and the sciences: the changes may be as big as the ones introduced by the printing press. As promoted, its biggest impact is that now we don’t need to research how to automate and teach a computer to do things: just inferring probabilities from big amounts of data is enough.
In the past, data collection, storing and analyzing methods were expensive and time consuming: in the year 2000, digital information was just one-quarter of the world’s stored information. Now we can easily capture and store ever-growing amounts of data: today, only 1% of all the stored information is non-digital, since the digital data is growing exponentially.
But behind the Big Data hype, there’s also Big Unawareness of statistical sciences:
- Big data may allow to cheat and work backward (data->analysis->conclusions from correlations), but correlation does not imply causation and the traditional scientific method is not to be forgotten. The same statistical error may be made on a grander scale.
- Statistical models and scientific understanding are yet needed, since more data brings more spurious patterns that obscure a constant number of the genuine insights: the signal to noise ratio quickly drops to zero without careful analysis. The mind frame of the researcher is as important as always: the only answers to be found are the ones that the researcher is looking for.
- More data doesn’t always mean more accuracy: the bigger the data set, the more likely it is to have errors and the higher the number of false positives inferred. More data may not cancel out errors and carefully sampled subsets may still outperform.
- Not everything can be captured, the question about what is missing is still there and sampling bias and error must still be considered: sampling bias is more impactful that sampling error, since there always the question of what underlying population has been captured by the data.
In the other words, Big Data does not equal Big Insights: science, deep reasoning and proper inferencing are as necessary as ever, and statisticians are beginning to modify and fine-tune their toolsets: as a remedy, I predict that tools from the Automated Reasoning field will also be increasingly adopted to fight this data avalanche.
A graphical summary to Caspers Jones’ latest book, “The Technical and Social History of Software Engineering”, aggregating the data of thousands of projects:
- Note how application size is lowering in terms of number of lines of code, in direct correlation to the linear increase in the expressive power of programming languages. This observation fits well the growing number of web/mobile application that only do a very limited number of functions.
- The maximum percentage of code reuse is growing very fast, due to a higher number of libraries and open-source, but spotting projects with a 85% of reuse is a yet a rarity.
- Defect removal efficiency has steadily improved, but I expected a steeper line due to static analysis and better compiler warnings
- The percentage of personal dedicated to maintenance has surpassed that of the initial development, but there’s little research on the success factors of this stage.
As languages improved (and their number, so more languages are available for specific tasks), so did the programmer’s productivity, lowering the defect potential at the same time: this document about software engineering laws also provides another interesting outlook of the same datasets.
Imagine devising a set of rules for a game such that the dominant strategy of every player is to truthfully reveal their valuations and/or strategies: this is just one of the ambitious goals of mechanism design, the science of rule-making and the most useful branch of game theory. Fifteen years ago, a pioneering paper of Nisam and Ronen (Algorithmic Mechanism Design) merged it with computer science by including the requisite that computations should also be reasonably tractable for every involved player: this created a fruitful field of research that contributed every tool of algorithmics and computational complexity, from combinatorial optimization and linear programming to approximation algorithms and complexity classes.
In practice, Algorithmic Mechanism Design is also behind the successes of the modern Internet economy: every ad-auctions uses it results, like Google’s DoubleClick auctions or Yahoo’s Auctions, and peer-to-peer networks and network protocols are being designed under its guiding principles. It has also contributed to spectrum auctions and matching markets (kidneys, school choice systems and medical positions) and it has also generated interesting models, like the first one that justifies the optimality of the fee-setting structure of real estate agents, stock brokers and auction houses (see Fee Setting Intermediaries).
Up until a decade ago, the only way to learn this fascinating field of research was by venturing to read papers dispersed between the areas of economics, game theory and computer science, but this changed in 2008 with the publication of the basic textbook of the field, Algorithmic Game Theory, also available online:
Now a bit dated, it has recently been complemented with some great resources:
- The Handbook of Market Design, of which the part I have liked the most is the one on experiments.
- The online courses of Tim Roughgarden, a real master on choosing and presenting the best proofs of this field: Algorithmic Game Theory and Frontiers in Mechanism Design.
- The on-going writing of a more specific book, Mechanism Design and Approximation
And that’s enough to begin with: hundred of hours of learning insightful research with fantastic applications!
- February 2017
- April 2014
- March 2014
- December 2013
- November 2013
- July 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011