A quick summary from my experiences with Hadoop:

  • Don’t lost focus on what really matters: not to efficiently store and retrieve fabulous amount of data, but to extract useful insights from it.
    • The quickest way to start analyzing big amounts of data is by re-using R code from CRAN with the help of Cascading, a tool that generates PMML models. Mahout is a very good alternative, but not very efficient at the moment.
    • Most Hadoop deployments in the world are in experimental phases and not in production: they are proof of concepts. Many projects will fail to meet expectations because they expect too much too soon, even when basic functionality is not mature enough (real-time querying –Impala, Stinger v3, Drill-, processing of graph structures -Giraph-, granular security -Accumulo-).
  • Hadoop is not a replacement for the traditional DataWarehouse: the strength of Hadoop is that it handles massive amounts of unstructured data that DWHs weren’t designed for. It’s true that Hadoop is less expensive than any DWH, but that doesn’t mean that it will replace all their workloads: redesign the DW architecture to make a place for Hadoop.
  • HBase and HDFS are very different: use HBase to serve content on websites and reference data, since it’s designed for key/value lookups, fast range scans and maintaining versions; use HDFS for ETL and heavy workloads, since it’s designed for batch processing and large data scans.
  • Basic architectural tips:
    • Faster healing time for larger clusters.
    • More racks offer more failure domains.
    • Plan for high-availability for the master/name node: configure a secondary NameNode for a HA standby architecture with periodical updating.
    • The raison d’être of the cloud is elasticity: reserve the floor of measured demand and spin up capacity on-demand. Consider the automation of the removal of datanodes when not in use.
    • The namespace node is factor that could limit growth since it keeps the entire namespace in RAM: more than 100 GB may be necessary for very large deployments (1 GB metadata is typically used for to 1 PB of storage).
    • Plan for enough capacity: storage should never reach 80%, or the cluster will start to get slower. Spare nodes enable the cluster to run on failures.
    • Nodes must be NTP-synchronized.
    • When everything is properly setup, an operator should manage 5K nodes.
  • The performance of Hadoop tasks is I/O-bound by design: beware that public cloud servers (Azure/Amazon) are not designed for I/O intensive tasks, just for in-memory processing. Usually, the storage is separated from the CPUs by a network (NAS): this architecture impacts the performance more than disk virtualization so, whenever possible, try to use local storage with storage optimized instances.
    • On the other hand, using cloud storage (S3, AVS) also has its advantages: you will be able to re-use the stored files for different clusters without needing to create a copy for each cluster; also, the availability of these cloud storages is much higher.
  • Many times, processing is memory-bound and not IO/CPU-bound (real-time queries are memory hungry) so take extra care to conserve precious memory while architecting and coding.
  • Only consider to write Map/Reduce code for the most common of Hive/Pig queries: Map/Reduce is the assembly of Hadoop, use it as last recourse.
  • Operations
    • Balance data periodically, in particular after growing a cluster.
    • Cold data should be archived and hot-data over-replicated.
    • Set quotas for everything: they will help you to stop domino failures.
  • Always backup the namenode. Also, consider to mount several redundant directories for its metadata (NFS).
  • Monitoring and performance tuning: the only way to start optimizing your code is to collect statistics while running jobs using the best available tools (Nagios, Operations Manager, …). There’s also specialized software to monitor Hadoop loads (Ambari):
    • You should monitor everything: disk I/O and SMART statistics, size and number of open files over time, network I/O, CPU, memory, RPC metrics, JVM statistics, etc… Analyze and correlate these with Hadoop statistics (HDFS, MapReduce, Hive).
    • You will discover that enabling compression, using a better algorithm for task scheduling, incrementing the number of threads, parallel copies and the size of the HDFS blocksize/map are common changes: every Hadoop distribution seems to keep them too low. Note that larger blocks per map imply larger heap-sizes for the map-outputs to be sort in the map’s sort-buffer.
    • The number of map tasks should be less than half the number of available processor cores, and the number of reduce tasks half the number of map tasks. Avoid having too many maps or many maps with a very short run-time.
    • The number of reduces is a decisive factor: too many reduces produce countless small files that decrease performance; on the other hand, if there are a very little number of reduces, each may have too process too big loads per reduce.
    • Correct JVM configuration is a must (and it’s not only about the maximum amount of memory per virtual machine): only use a 64bit JVM with low-latency garbage collector.
    • Find and analyze failed datanodes: long term, it could help save a cluster in case the problem starts replicating.

Apache Hadoop, a rework by Yahoo of the Google File System and MapReduce, has become the lingua franca of the Big Data movement. By the trail of Google’s success, the MapReduce paradigm manages to reach a successful balance between a reasonable developer learning curve, scalability and fault-tolerance for storing and querying very large datasets.

But the true history and community efforts behind Hadoop is much more complex: Google left behind the constraints of GFS and MapReduce long time ago to more efficient and capable technologies. And so did other open-source projects: Hive and Pig to carry BI/analytics queries in a low-latency fashion, matching Google’s BigQuery; HBase and Storm for real-time search and incremental indexing, substituting the Percolator engine at Google and the BigTable store; and Giraph, for carrying out large-scale graph-processing computations with significant speedups, like the Pregel framework at Google.

So, when talking about Hadoop, you have to distinguish between the core Hadoop stack (HDFS, ZooKeeper) and the growing number of projects surrounding this core. Many companies are creating Hadoop distributions (Cloudera, HortonWorks, MapR), capitalizing on the growing need for commercial support by integrating a subset of those projects in an easy-to-install package. Except that they will never include anything innovative nor experimental, the main reason behind their existence just being support and not cutting-edge research and development: groundbreaking developments will always happen outside them, since support costs always grow higher with new lines of code than their initial development cost.

Here, the parallel with Linux ecosystem and distributions is clear, since fragmentation between distributions will severely break interoperability in the very same way: currently, no one has proposed an equivalent for the Hadoop ecosystem to the Linux Foundation, the entity that proposes the Linux Standard Base and the Filesystem Hierarchy Standard. But then again, the incentive between the Hadoop distributions is to create lock-in, which should increase fragmentation, not the other way around.

Hadoop on Azure is just another distribution, and the comparative advantage of using this or any other distribution is null if they don’t include the full list of revolutionary projects that extend the core Hadoop stack to enable the most innovative ventures. It’s not about the Azure’s price of the underlying storage or computation. The key thing is to develop and openly provide all the wrapper and glue-code that now is individually created by every user of these projects: that will be a real challenge to Microsoft Open Technologies.

Set your Twitter account name in your settings to use the TwitterBar Section.