{"id":1469,"date":"2014-04-11T02:22:28","date_gmt":"2014-04-11T00:22:28","guid":{"rendered":"http:\/\/cerezo.name\/blog\/?p=1469"},"modified":"2024-10-14T14:47:27","modified_gmt":"2024-10-14T12:47:27","slug":"best-practices-on-hadoop","status":"publish","type":"post","link":"http:\/\/cerezo.name\/blog\/2014\/04\/11\/best-practices-on-hadoop\/","title":{"rendered":"Best Practices on Hadoop"},"content":{"rendered":"<p>A quick summary from my experiences with Hadoop:<\/p>\n<ul>\n<li>Don\u2019t lost focus on what really matters: not to efficiently store and retrieve fabulous amount of data, but to extract useful insights from&nbsp;it.&nbsp;<ul>\n<li>The quickest way to start analyzing big amounts of data is by re-using R code from <span class=\"caps\">CRAN<\/span> with the help of <a href=\"http:\/\/www.cascading.org\" target=\"_blank\" rel=\"noopener\" class=\"broken_link\">Cascading<\/a>, a tool that generates <span class=\"caps\">PMML<\/span> models. Mahout is a very good alternative, but not very efficient at the moment.<\/li>\n<li>Most Hadoop deployments in the world are in experimental phases and not in production: they are proof of concepts. Many projects will fail to meet expectations because they expect too much too soon, even when basic functionality is not mature enough (real-time querying \u2013Impala, Stinger v3, Drill\u2011, processing of graph structures \u2011Giraph\u2011, granular security \u2011Accumulo-).<\/li>\n<\/ul>\n<p><span id=\"g25598792\">&nbsp;<\/span><\/p><\/li>\n<li><span style=\"line-height: 1.5em;\">Hadoop is not a replacement for the traditional DataWarehouse: the strength of Hadoop is that it handles massive amounts of unstructured data that DWHs weren\u2019t designed for. It\u2019s true that Hadoop is less expensive than any <span class=\"caps\">DWH<\/span>, but that doesn\u2019t mean that it will replace all their workloads: redesign the <span class=\"caps\">DW<\/span> architecture to make a place for Hadoop.<\/span><\/li>\n<li><span style=\"line-height: 1.5em;\">HBase and <span class=\"caps\">HDFS<\/span> are very different: use HBase to serve content on websites and reference data, since it\u2019s designed for key\/value lookups, fast range scans and maintaining versions; use <span class=\"caps\">HDFS<\/span> for <span class=\"caps\">ETL<\/span> and heavy workloads, since it\u2019s designed for batch processing and large data&nbsp;scans.<\/span><\/li>\n<li><span style=\"line-height: 1.5em;\">Basic architectural tips:<\/span>\n<ul style=\"line-height: 1.5em;\">\n<li>Faster healing time for larger clusters.<\/li>\n<li>More racks offer more failure domains.<\/li>\n<li>Plan for high-availability for the master\/name node: configure a secondary NameNode for a <span class=\"caps\">HA<\/span> standby architecture with periodical updating.<\/li>\n<li>The <i>raison d\u2019\u00eatre<\/i> of the cloud is elasticity: reserve the floor of measured demand and spin up capacity on-demand. Consider the automation of the removal of datanodes when not in&nbsp;use.<\/li>\n<li>The namespace node is factor that could limit growth since it keeps the entire namespace in <span class=\"caps\">RAM<\/span>: more than 100 <span class=\"caps\">GB<\/span> may be necessary for very large deployments (1 <span class=\"caps\">GB<\/span> metadata is typically used for to 1 <span class=\"caps\">PB<\/span> of storage).<\/li>\n<li>Plan for enough capacity: storage should never reach 80%, or the cluster will start to get slower. Spare nodes enable the cluster to run on failures.<\/li>\n<li>Nodes must be NTP-synchronized.<\/li>\n<li>When everything is properly setup, an operator should manage <span class=\"caps\">5K<\/span>&nbsp;nodes.<\/li>\n<\/ul>\n<\/li>\n<li><span style=\"line-height: 1.5em;\">The performance of Hadoop tasks is I\/O\u2011bound by design: beware that public cloud servers (Azure\/Amazon) are not designed for I\/O intensive tasks, just for in-memory processing. Usually, the storage is separated from the CPUs by a network (<span class=\"caps\">NAS<\/span>): this architecture impacts the performance more than disk virtualization so, whenever possible, try to use local storage with storage optimized instances.<\/span>\n<ul style=\"line-height: 1.5em;\">\n<li>On the other hand, using cloud storage (<span class=\"caps\">S3<\/span>, <span class=\"caps\">AVS<\/span>) also has its advantages: you will be able to re-use the stored files for different clusters without needing to create a copy for each cluster; also, the availability of these cloud storages is much higher.<\/li>\n<\/ul>\n<\/li>\n<li><span style=\"line-height: 1.5em;\">Many times, processing is memory-bound and not <span class=\"caps\">IO<\/span>\/CPU-bound (real-time queries are memory hungry) so take extra care to conserve precious memory while architecting and coding.<\/span><\/li>\n<li><span style=\"line-height: 1.5em;\">Only consider to write Map\/Reduce code for the most common of Hive\/Pig queries: Map\/Reduce is the assembly of Hadoop, use it as last recourse.<\/span><\/li>\n<li><span style=\"line-height: 1.5em;\">Operations<\/span>\n<ul style=\"line-height: 1.5em;\">\n<li>Balance data periodically, in particular after growing a cluster.<\/li>\n<li>Cold data should be archived and hot-data over-replicated.<\/li>\n<li>Set quotas for everything: they will help you to stop domino failures.<\/li>\n<\/ul>\n<\/li>\n<li><span style=\"line-height: 1.5em;\">Always backup the namenode. Also, consider to mount several redundant directories for its metadata (<span class=\"caps\">NFS<\/span>).<\/span><\/li>\n<li><span style=\"line-height: 1.5em;\">Monitoring and performance tuning: the only way to start optimizing your code is to collect statistics while running jobs using the best available tools (Nagios, Operations Manager, \u2026). There\u2019s also specialized software to monitor Hadoop loads (Ambari):<\/span>\n<ul style=\"line-height: 1.5em;\">\n<li>You should monitor everything: disk I\/O and <span class=\"caps\">SMART<\/span> statistics, size and number of open files over time, network I\/O, <span class=\"caps\">CPU<\/span>, memory, <span class=\"caps\">RPC<\/span> metrics, <span class=\"caps\">JVM<\/span> statistics, etc\u2026 Analyze and correlate these with Hadoop statistics (<span class=\"caps\">HDFS<\/span>, MapReduce, Hive).<\/li>\n<li>You will discover that enabling compression, using a better algorithm for task scheduling, incrementing the number of threads, parallel copies and the size of the <span class=\"caps\">HDFS<\/span> blocksize\/map are common changes: every Hadoop distribution seems to keep them too low. Note that larger blocks per map imply larger heap-sizes for the map-outputs to be sort in the map\u2019s sort-buffer.<\/li>\n<li>The number of map tasks should be less than half the number of available processor cores, and the number of reduce tasks half the number of map tasks. Avoid having too many maps or many maps with a very short run-time.<\/li>\n<li>The number of reduces is a decisive factor: too many reduces produce countless small files that decrease performance; on the other hand, if there are a very little number of reduces, each may have too process too big loads per reduce.<\/li>\n<li>Correct <span class=\"caps\">JVM<\/span> configuration is a must (and it\u2019s not only about the maximum amount of memory per virtual machine): only use a 64bit <span class=\"caps\">JVM<\/span> with low-latency garbage collector.<\/li>\n<li>Find and analyze failed datanodes: long term, it could help save a cluster in case the problem starts replicating.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>A quick summary from my experiences with Hadoop: Don\u2019t lost focus on what really matters: not to efficiently store and retrieve fabulous amount of data, but to extract useful insights from&nbsp;it.&nbsp; The quickest way to start analyzing big amounts of data is by re-using R code from <span class=\"caps\">CRAN<\/span> with the help of Cascading, a&nbsp;tool&nbsp;[\u2026]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"wp_typography_post_enhancements_disabled":false,"ngg_post_thumbnail":0},"categories":[27,20],"tags":[],"_links":{"self":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts\/1469"}],"collection":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/comments?post=1469"}],"version-history":[{"count":2,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts\/1469\/revisions"}],"predecessor-version":[{"id":1699,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/posts\/1469\/revisions\/1699"}],"wp:attachment":[{"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/media?parent=1469"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/categories?post=1469"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/cerezo.name\/blog\/wp-json\/wp\/v2\/tags?post=1469"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}