{"id":1057,"date":"2012-07-14T20:20:13","date_gmt":"2012-07-14T18:20:13","guid":{"rendered":"http:\/\/cerezo.name\/blog\/?p=1057"},"modified":"2024-10-14T13:58:00","modified_gmt":"2024-10-14T11:58:00","slug":"hadoopazure-a-recipe-to-win-the-big-data-race","status":"publish","type":"post","link":"http:\/\/cerezo.name\/blog\/2012\/07\/14\/hadoopazure-a-recipe-to-win-the-big-data-race\/","title":{"rendered":"Hadoop@Azure: a Recipe to Win the Big Data&nbsp;Race"},"content":{"rendered":"<p style=\"text-align: justify;\"><a href=\"http:\/\/hadoop.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Hadoop<\/a>, a rework by Yahoo of the <a href=\"http:\/\/research.google.com\/archive\/gfs.html\" target=\"_blank\" rel=\"noopener\">Google File System<\/a> and <a href=\"https:\/\/hadoop.apache.org\/docs\/r1.2.1\/mapred_tutorial.html\" target=\"_blank\" rel=\"noopener\">MapReduce<\/a>, has become the <em>lingua franca<\/em> of the Big Data movement. By the trail of Google\u2019s success, the MapReduce paradigm manages to reach a successful balance between a reasonable developer learning curve, scalability and fault-tolerance for storing and querying very large datasets.<\/p>\n<p style=\"text-align: justify;\">But the true history and community efforts behind Hadoop is much more complex: Google left behind the constraints of <span class=\"caps\">GFS<\/span> and MapReduce long time ago to more efficient and capable technologies. And so did other open-source projects: <a href=\"http:\/\/hive.apache.org\/\" target=\"_blank\" rel=\"noopener\">Hive<\/a> and <a href=\"http:\/\/pig.apache.org\/\" target=\"_blank\" rel=\"noopener\">Pig<\/a> to carry <span class=\"caps\">BI<\/span>\/analytics queries in a low-latency fashion, matching Google\u2019s <a href=\"https:\/\/developers.google.com\/bigquery\/\" target=\"_blank\" rel=\"noopener\">BigQuery<\/a>; <a href=\"http:\/\/hbase.apache.org\/\" target=\"_blank\" rel=\"noopener\">HBase<\/a> and <a href=\"http:\/\/storm-project.net\/\" target=\"_blank\" rel=\"noopener\">Storm<\/a> for real-time search and incremental indexing, substituting the <a href=\"http:\/\/research.google.com\/pubs\/pub36726.html\" target=\"_blank\" rel=\"noopener\">Percolator<\/a> engine at Google and the <a href=\"http:\/\/research.google.com\/archive\/bigtable.html\" target=\"_blank\" rel=\"noopener\">BigTable<\/a> store; and <a href=\"http:\/\/giraph.apache.org\/\" target=\"_blank\" rel=\"noopener\">Giraph<\/a>, for carrying out large-scale graph-processing computations with significant speedups, like the <a href=\"http:\/\/googleresearch.blogspot.com.es\/2009\/06\/large-scale-graph-computing-at-google.html\" target=\"_blank\" rel=\"noopener\">Pregel<\/a> framework at Google.<\/p>\n<p style=\"text-align: justify;\">So, when talking about Hadoop, you have to distinguish between the core Hadoop stack (<a href=\"http:\/\/hortonworks.com\/hadoop\/hdfs\/\" target=\"_blank\" rel=\"noopener\"><span class=\"caps\">HDFS<\/span><\/a>, <a href=\"http:\/\/zookeeper.apache.org\/\" target=\"_blank\" rel=\"noopener\">ZooKeeper<\/a>) and the growing number of projects surrounding this core. Many companies are creating Hadoop distributions (<a href=\"http:\/\/www.cloudera.com\/content\/cloudera\/en\/why-cloudera\/hadoop-and-big-data.html\" target=\"_blank\" rel=\"noopener\" class=\"broken_link\">Cloudera<\/a>, <a href=\"http:\/\/hortonworks.com\/products\/hortonworksdataplatform\/\" target=\"_blank\" rel=\"noopener\">HortonWorks<\/a>, <a href=\"http:\/\/www.mapr.com\/products\/mapr-editions\/m5-edition\" target=\"_blank\" rel=\"noopener\" class=\"broken_link\">MapR<\/a>), capitalizing on the growing need for commercial support by integrating a subset of those projects in an easy-to-install package. Except that they will never include anything innovative nor experimental, the main reason behind their existence just being support and not cutting-edge research and development: groundbreaking developments will always happen outside them, since support costs always grow higher with new lines of code than their initial development cost.<\/p>\n<p style=\"text-align: justify;\">Here, the parallel with Linux ecosystem and distributions is clear, since fragmentation between distributions will severely break interoperability in the very same way: currently, no one has proposed an equivalent for the Hadoop ecosystem to the <a href=\"http:\/\/www.linuxfoundation.org\/\" target=\"_blank\" rel=\"noopener\">Linux Foundation<\/a>, the entity that proposes the <a href=\"http:\/\/www.linuxfoundation.org\/collaborate\/workgroups\/lsb\" target=\"_blank\" rel=\"noopener\" class=\"broken_link\">Linux Standard Base<\/a> and the <a href=\"http:\/\/www.pathname.com\/fhs\/\" target=\"_blank\" rel=\"noopener\">Filesystem Hierarchy Standard<\/a>. But then again, the incentive between the Hadoop distributions is to create lock-in, which should increase fragmentation, not the other way around.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/www.hadooponazure.com\/\" target=\"_blank\" rel=\"noopener\">Hadoop on Azure<\/a> is just another distribution, and the comparative advantage of using this or any other distribution is null if they don\u2019t include the full list of revolutionary projects that extend the core Hadoop stack to enable the most innovative ventures. It\u2019s not about the Azure\u2019s price of the underlying storage or computation. The key thing is to develop and openly provide all the wrapper and glue-code that now is individually created by every user of these projects: that will be a real challenge to <a class=\"wikinvest-suggestion-link broken_link\" href=\"http:\/\/www.wikinvest.com\/stock\/Microsoft_(MSFT)\" target=\"_blank\" rel=\"noopener\">Microsoft<\/a><a href=\"http:\/\/www.microsoft.com\/en-us\/openness\/\" target=\"_blank\" rel=\"noopener\"> Open Technologies<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Hadoop, a rework by Yahoo of the Google File System and MapReduce, has become the lingua franca of the Big Data movement. By the trail of Google\u2019s success, the MapReduce paradigm manages to reach a successful balance between a reasonable developer learning curve, scalability and fault-tolerance for storing and querying very large datasets. But&nbsp;[\u2026]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"wp_typography_post_enhancements_disabled":false,"ngg_post_thumbnail":0},"categories":[20],"tags":[],"_links":{"self":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts\/1057"}],"collection":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/comments?post=1057"}],"version-history":[{"count":6,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts\/1057\/revisions"}],"predecessor-version":[{"id":1592,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts\/1057\/revisions\/1592"}],"wp:attachment":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/media?parent=1057"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/categories?post=1057"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/tags?post=1057"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}