Monthly Archives: April 2014

Best Practices on Hadoop

A quick summary from my experiences with Hadoop:

  • Don’t lost focus on what really matters: not to efficiently store and retrieve fabulous amount of data, but to extract useful insights from it. 
    • The quickest way to start analyzing big amounts of data is by re-using R code from CRAN with the help of Cascading, a tool that generates PMML models. Mahout is a very good alternative, but not very efficient at the moment.
    • Most Hadoop deployments in the world are in experimental phases and not in production: they are proof of concepts. Many projects will fail to meet expectations because they expect too much too soon, even when basic functionality is not mature enough (real-time querying –Impala, Stinger v3, Drill‑, processing of graph structures ‑Giraph‑, granular security ‑Accumulo-).

     

  • Hadoop is not a replacement for the traditional DataWarehouse: the strength of Hadoop is that it handles massive amounts of unstructured data that DWHs weren’t designed for. It’s true that Hadoop is less expensive than any DWH, but that doesn’t mean that it will replace all their workloads: redesign the DW architecture to make a place for Hadoop.
  • HBase and HDFS are very different: use HBase to serve content on websites and reference data, since it’s designed for key/value lookups, fast range scans and maintaining versions; use HDFS for ETL and heavy workloads, since it’s designed for batch processing and large data scans.
  • Basic architectural tips:
    • Faster healing time for larger clusters.
    • More racks offer more failure domains.
    • Plan for high-availability for the master/name node: configure a secondary NameNode for a HA standby architecture with periodical updating.
    • The raison d’être of the cloud is elasticity: reserve the floor of measured demand and spin up capacity on-demand. Consider the automation of the removal of datanodes when not in use.
    • The namespace node is factor that could limit growth since it keeps the entire namespace in RAM: more than 100 GB may be necessary for very large deployments (1 GB metadata is typically used for to 1 PB of storage).
    • Plan for enough capacity: storage should never reach 80%, or the cluster will start to get slower. Spare nodes enable the cluster to run on failures.
    • Nodes must be NTP-synchronized.
    • When everything is properly setup, an operator should manage 5K nodes.
  • The performance of Hadoop tasks is I/O‑bound by design: beware that public cloud servers (Azure/Amazon) are not designed for I/O intensive tasks, just for in-memory processing. Usually, the storage is separated from the CPUs by a network (NAS): this architecture impacts the performance more than disk virtualization so, whenever possible, try to use local storage with storage optimized instances.
    • On the other hand, using cloud storage (S3, AVS) also has its advantages: you will be able to re-use the stored files for different clusters without needing to create a copy for each cluster; also, the availability of these cloud storages is much higher.
  • Many times, processing is memory-bound and not IO/CPU-bound (real-time queries are memory hungry) so take extra care to conserve precious memory while architecting and coding.
  • Only consider to write Map/Reduce code for the most common of Hive/Pig queries: Map/Reduce is the assembly of Hadoop, use it as last recourse.
  • Operations
    • Balance data periodically, in particular after growing a cluster.
    • Cold data should be archived and hot-data over-replicated.
    • Set quotas for everything: they will help you to stop domino failures.
  • Always backup the namenode. Also, consider to mount several redundant directories for its metadata (NFS).
  • Monitoring and performance tuning: the only way to start optimizing your code is to collect statistics while running jobs using the best available tools (Nagios, Operations Manager, …). There’s also specialized software to monitor Hadoop loads (Ambari):
    • You should monitor everything: disk I/O and SMART statistics, size and number of open files over time, network I/O, CPU, memory, RPC metrics, JVM statistics, etc… Analyze and correlate these with Hadoop statistics (HDFS, MapReduce, Hive).
    • You will discover that enabling compression, using a better algorithm for task scheduling, incrementing the number of threads, parallel copies and the size of the HDFS blocksize/map are common changes: every Hadoop distribution seems to keep them too low. Note that larger blocks per map imply larger heap-sizes for the map-outputs to be sort in the map’s sort-buffer.
    • The number of map tasks should be less than half the number of available processor cores, and the number of reduce tasks half the number of map tasks. Avoid having too many maps or many maps with a very short run-time.
    • The number of reduces is a decisive factor: too many reduces produce countless small files that decrease performance; on the other hand, if there are a very little number of reduces, each may have too process too big loads per reduce.
    • Correct JVM configuration is a must (and it’s not only about the maximum amount of memory per virtual machine): only use a 64bit JVM with low-latency garbage collector.
    • Find and analyze failed datanodes: long term, it could help save a cluster in case the problem starts replicating.

Assorted Links (Algorithms)

Preventing more Heartbleeds

It’s all over the news: a vulnerability has been found on OpenSSL that leaks memory contents on server and clients. Named Heartbleed, it has a very simple patch and some informative posts have already been written about it (Troy Hunt, Matthew Green).

What nobody is saying is that the real root cause is the lack of modern memory management in the C language: OpenSSL added a wrapper around malloc() to manage memory in a more secure and efficient way, effectively bypassing some improvements that have been made in this area during a decade; specifically, it tries to improve the reuse of allocated memory by avoiding to free() it. Now enter Heartbleed: by a very simple bug (intentional or not), the attacker is able to retrieve chosen memory areas. What was the real use of that layer?

Face it: it’s a no-win situation. No matter how many ways these layers are going to be written, there will always be a chance for error. You can’t have secure code in C.

But re-writing and/or throwing away thousands of security related programs written in C is no-brainer: the only way to securely run these programs is with the help of some memory debuggers techniques, like those used by Insure++ or Rational Purify. For example, the next technical report contains a detailed analysis of some of these techniques that prevent these kind of vulnerabilities:

Download (PDF, 1.99MB)

Assorted Links (Crypto)