Apache Hadoop, a rework by Yahoo of the Google File System and MapReduce, has become the lingua franca of the Big Data movement. By the trail of Google’s success, the MapReduce paradigm manages to reach a successful balance between a reasonable developer learning curve, scalability and fault-tolerance for storing and querying very large datasets.
But the true history and community efforts behind Hadoop is much more complex: Google left behind the constraints of GFS and MapReduce long time ago to more efficient and capable technologies. And so did other open-source projects: Hive and Pig to carry BI/analytics queries in a low-latency fashion, matching Google’s BigQuery; HBase and Storm for real-time search and incremental indexing, substituting the Percolator engine at Google and the BigTable store; and Giraph, for carrying out large-scale graph-processing computations with significant speedups, like the Pregel framework at Google.
So, when talking about Hadoop, you have to distinguish between the core Hadoop stack (HDFS, ZooKeeper) and the growing number of projects surrounding this core. Many companies are creating Hadoop distributions (Cloudera, HortonWorks, MapR), capitalizing on the growing need for commercial support by integrating a subset of those projects in an easy-to-install package. Except that they will never include anything innovative nor experimental, the main reason behind their existence just being support and not cutting-edge research and development: groundbreaking developments will always happen outside them, since support costs always grow higher with new lines of code than their initial development cost.
Here, the parallel with Linux ecosystem and distributions is clear, since fragmentation between distributions will severely break interoperability in the very same way: currently, no one has proposed an equivalent for the Hadoop ecosystem to the Linux Foundation, the entity that proposes the Linux Standard Base and the Filesystem Hierarchy Standard. But then again, the incentive between the Hadoop distributions is to create lock-in, which should increase fragmentation, not the other way around.
Hadoop on Azure is just another distribution, and the comparative advantage of using this or any other distribution is null if they don’t include the full list of revolutionary projects that extend the core Hadoop stack to enable the most innovative ventures. It’s not about the Azure’s price of the underlying storage or computation. The key thing is to develop and openly provide all the wrapper and glue-code that now is individually created by every user of these projects: that will be a real challenge to Microsoft Open Technologies.