A quick summary from my experiences with Hadoop:

  • Don’t lost focus on what really matters: not to efficiently store and retrieve fabulous amount of data, but to extract useful insights from it.
    • The quickest way to start analyzing big amounts of data is by re-using R code from CRAN with the help of Cascading, a tool that generates PMML models. Mahout is a very good alternative, but not very efficient at the moment.
    • Most Hadoop deployments in the world are in experimental phases and not in production: they are proof of concepts. Many projects will fail to meet expectations because they expect too much too soon, even when basic functionality is not mature enough (real-time querying –Impala, Stinger v3, Drill-, processing of graph structures -Giraph-, granular security -Accumulo-).
  • Hadoop is not a replacement for the traditional DataWarehouse: the strength of Hadoop is that it handles massive amounts of unstructured data that DWHs weren’t designed for. It’s true that Hadoop is less expensive than any DWH, but that doesn’t mean that it will replace all their workloads: redesign the DW architecture to make a place for Hadoop.
  • HBase and HDFS are very different: use HBase to serve content on websites and reference data, since it’s designed for key/value lookups, fast range scans and maintaining versions; use HDFS for ETL and heavy workloads, since it’s designed for batch processing and large data scans.
  • Basic architectural tips:
    • Faster healing time for larger clusters.
    • More racks offer more failure domains.
    • Plan for high-availability for the master/name node: configure a secondary NameNode for a HA standby architecture with periodical updating.
    • The raison d’être of the cloud is elasticity: reserve the floor of measured demand and spin up capacity on-demand. Consider the automation of the removal of datanodes when not in use.
    • The namespace node is factor that could limit growth since it keeps the entire namespace in RAM: more than 100 GB may be necessary for very large deployments (1 GB metadata is typically used for to 1 PB of storage).
    • Plan for enough capacity: storage should never reach 80%, or the cluster will start to get slower. Spare nodes enable the cluster to run on failures.
    • Nodes must be NTP-synchronized.
    • When everything is properly setup, an operator should manage 5K nodes.
  • The performance of Hadoop tasks is I/O-bound by design: beware that public cloud servers (Azure/Amazon) are not designed for I/O intensive tasks, just for in-memory processing. Usually, the storage is separated from the CPUs by a network (NAS): this architecture impacts the performance more than disk virtualization so, whenever possible, try to use local storage with storage optimized instances.
    • On the other hand, using cloud storage (S3, AVS) also has its advantages: you will be able to re-use the stored files for different clusters without needing to create a copy for each cluster; also, the availability of these cloud storages is much higher.
  • Many times, processing is memory-bound and not IO/CPU-bound (real-time queries are memory hungry) so take extra care to conserve precious memory while architecting and coding.
  • Only consider to write Map/Reduce code for the most common of Hive/Pig queries: Map/Reduce is the assembly of Hadoop, use it as last recourse.
  • Operations
    • Balance data periodically, in particular after growing a cluster.
    • Cold data should be archived and hot-data over-replicated.
    • Set quotas for everything: they will help you to stop domino failures.
  • Always backup the namenode. Also, consider to mount several redundant directories for its metadata (NFS).
  • Monitoring and performance tuning: the only way to start optimizing your code is to collect statistics while running jobs using the best available tools (Nagios, Operations Manager, …). There’s also specialized software to monitor Hadoop loads (Ambari):
    • You should monitor everything: disk I/O and SMART statistics, size and number of open files over time, network I/O, CPU, memory, RPC metrics, JVM statistics, etc… Analyze and correlate these with Hadoop statistics (HDFS, MapReduce, Hive).
    • You will discover that enabling compression, using a better algorithm for task scheduling, incrementing the number of threads, parallel copies and the size of the HDFS blocksize/map are common changes: every Hadoop distribution seems to keep them too low. Note that larger blocks per map imply larger heap-sizes for the map-outputs to be sort in the map’s sort-buffer.
    • The number of map tasks should be less than half the number of available processor cores, and the number of reduce tasks half the number of map tasks. Avoid having too many maps or many maps with a very short run-time.
    • The number of reduces is a decisive factor: too many reduces produce countless small files that decrease performance; on the other hand, if there are a very little number of reduces, each may have too process too big loads per reduce.
    • Correct JVM configuration is a must (and it’s not only about the maximum amount of memory per virtual machine): only use a 64bit JVM with low-latency garbage collector.
    • Find and analyze failed datanodes: long term, it could help save a cluster in case the problem starts replicating.

Big Data is shaking up everything, from education, economics, businesses and the sciences: the changes may be as big as the ones introduced by the printing press. As promoted, its biggest impact is that now we don’t need to research how to automate and teach a computer to do things: just inferring probabilities from big amounts of data is enough.

In the past, data collection, storing and analyzing methods were expensive and time consuming: in the year 2000, digital information was just one-quarter of the world’s stored information. Now we can easily capture and store ever-growing amounts of data: today, only 1% of all the stored information is non-digital, since the digital data is growing exponentially.

But behind the Big Data hype, there’s also Big Unawareness of statistical sciences:

  • Big data may allow to cheat and work backward (data->analysis->conclusions from correlations), but correlation does not imply causation and the traditional scientific method is not to be forgotten. The same statistical error may be made on a grander scale.
  • Statistical models and scientific understanding are yet needed, since more data brings more spurious patterns that obscure a constant number of the genuine insights: the signal to noise ratio quickly drops to zero without careful analysis. The mind frame of the researcher is as important as always: the only answers to be found are the ones that the researcher is looking for.
  • More data doesn’t always mean more accuracy: the bigger the data set, the more likely it is to have errors and the higher the number of false positives inferred. More data may not cancel out errors and carefully sampled subsets may still outperform.
  • Not everything can be captured, the question about what is missing is still there and sampling bias and error must still be considered: sampling bias is more impactful that sampling error, since there always the question of what underlying population has been captured by the data.

In the other words, Big Data does not equal Big Insights: science, deep reasoning and proper inferencing are as necessary as ever, and statisticians are beginning to modify and fine-tune their toolsets: as a remedy, I predict that tools from the Automated Reasoning field will also be increasingly adopted to fight this data avalanche.

Set your Twitter account name in your settings to use the TwitterBar Section.