A quick summary from my experiences with Hadoop:
- Don’t lost focus on what really matters: not to efficiently store and retrieve fabulous amount of data, but to extract useful insights from it.
- The quickest way to start analyzing big amounts of data is by re-using R code from CRAN with the help of Cascading, a tool that generates PMML models. Mahout is a very good alternative, but not very efficient at the moment.
- Most Hadoop deployments in the world are in experimental phases and not in production: they are proof of concepts. Many projects will fail to meet expectations because they expect too much too soon, even when basic functionality is not mature enough (real-time querying –Impala, Stinger v3, Drill-, processing of graph structures –Giraph-, granular security –Accumulo-).
- Hadoop is not a replacement for the traditional DataWarehouse: the strength of Hadoop is that it handles massive amounts of unstructured data that DWHs weren’t designed for. It’s true that Hadoop is less expensive than any DWH, but that doesn’t mean that it will replace all their workloads: redesign the DW architecture to make a place for Hadoop.
- HBase and HDFS are very different: use HBase to serve content on websites and reference data, since it’s designed for key/value lookups, fast range scans and maintaining versions; use HDFS for ETL and heavy workloads, since it’s designed for batch processing and large data scans.
- Basic architectural tips:
- Faster healing time for larger clusters.
- More racks offer more failure domains.
- Plan for high-availability for the master/name node: configure a secondary NameNode for a HA standby architecture with periodical updating.
- The raison d’être of the cloud is elasticity: reserve the floor of measured demand and spin up capacity on-demand. Consider the automation of the removal of datanodes when not in use.
- The namespace node is factor that could limit growth since it keeps the entire namespace in RAM: more than 100 GB may be necessary for very large deployments (1 GB metadata is typically used for to 1 PB of storage).
- Plan for enough capacity: storage should never reach 80%, or the cluster will start to get slower. Spare nodes enable the cluster to run on failures.
- Nodes must be NTP-synchronized.
- When everything is properly setup, an operator should manage 5K nodes.
- The performance of Hadoop tasks is I/O-bound by design: beware that public cloud servers (Azure/Amazon) are not designed for I/O intensive tasks, just for in-memory processing. Usually, the storage is separated from the CPUs by a network (NAS): this architecture impacts the performance more than disk virtualization so, whenever possible, try to use local storage with storage optimized instances.
- On the other hand, using cloud storage (S3, AVS) also has its advantages: you will be able to re-use the stored files for different clusters without needing to create a copy for each cluster; also, the availability of these cloud storages is much higher.
- Many times, processing is memory-bound and not IO/CPU-bound (real-time queries are memory hungry) so take extra care to conserve precious memory while architecting and coding.
- Only consider to write Map/Reduce code for the most common of Hive/Pig queries: Map/Reduce is the assembly of Hadoop, use it as last recourse.
- Balance data periodically, in particular after growing a cluster.
- Cold data should be archived and hot-data over-replicated.
- Set quotas for everything: they will help you to stop domino failures.
- Always backup the namenode. Also, consider to mount several redundant directories for its metadata (NFS).
- Monitoring and performance tuning: the only way to start optimizing your code is to collect statistics while running jobs using the best available tools (Nagios, Operations Manager, …). There’s also specialized software to monitor Hadoop loads (Ambari):
- You should monitor everything: disk I/O and SMART statistics, size and number of open files over time, network I/O, CPU, memory, RPC metrics, JVM statistics, etc… Analyze and correlate these with Hadoop statistics (HDFS, MapReduce, Hive).
- You will discover that enabling compression, using a better algorithm for task scheduling, incrementing the number of threads, parallel copies and the size of the HDFS blocksize/map are common changes: every Hadoop distribution seems to keep them too low. Note that larger blocks per map imply larger heap-sizes for the map-outputs to be sort in the map’s sort-buffer.
- The number of map tasks should be less than half the number of available processor cores, and the number of reduce tasks half the number of map tasks. Avoid having too many maps or many maps with a very short run-time.
- The number of reduces is a decisive factor: too many reduces produce countless small files that decrease performance; on the other hand, if there are a very little number of reduces, each may have too process too big loads per reduce.
- Correct JVM configuration is a must (and it’s not only about the maximum amount of memory per virtual machine): only use a 64bit JVM with low-latency garbage collector.
- Find and analyze failed datanodes: long term, it could help save a cluster in case the problem starts replicating.
It’s all over the news: a vulnerability has been found on OpenSSL that leaks memory contents on server and clients. Named Heartbleed, it has a very simple patch and some informative posts have already been written about it (Troy Hunt, Matthew Green).
What nobody is saying is that the real root cause is the lack of modern memory management in the C language: OpenSSL added a wrapper around malloc() to manage memory in a more secure and efficient way, effectively bypassing some improvements that have been made in this area during a decade; specifically, it tries to improve the reuse of allocated memory by avoiding to free() it. Now enter Heartbleed: by a very simple bug (intentional or not), the attacker is able to retrieve chosen memory areas. What was the real use of that layer?
Face it: it’s a no-win situation. No matter how many ways these layers are going to be written, there will always be a chance for error. You can’t have secure code in C.
But re-writing and/or throwing away thousands of security related programs written in C is no-brainer: the only way to securely run these programs is with the help of some memory debuggers techniques, like those used by Insure++ or Rational Purify. For example, the next technical report contains a detailed analysis of some of these techniques that prevent these kind of vulnerabilities:
Big Data is shaking up everything, from education, economics, businesses and the sciences: the changes may be as big as the ones introduced by the printing press. As promoted, its biggest impact is that now we don’t need to research how to automate and teach a computer to do things: just inferring probabilities from big amounts of data is enough.
In the past, data collection, storing and analyzing methods were expensive and time consuming: in the year 2000, digital information was just one-quarter of the world’s stored information. Now we can easily capture and store ever-growing amounts of data: today, only 1% of all the stored information is non-digital, since the digital data is growing exponentially.
But behind the Big Data hype, there’s also Big Unawareness of statistical sciences:
- Big data may allow to cheat and work backward (data->analysis->conclusions from correlations), but correlation does not imply causation and the traditional scientific method is not to be forgotten. The same statistical error may be made on a grander scale.
- Statistical models and scientific understanding are yet needed, since more data brings more spurious patterns that obscure a constant number of the genuine insights: the signal to noise ratio quickly drops to zero without careful analysis. The mind frame of the researcher is as important as always: the only answers to be found are the ones that the researcher is looking for.
- More data doesn’t always mean more accuracy: the bigger the data set, the more likely it is to have errors and the higher the number of false positives inferred. More data may not cancel out errors and carefully sampled subsets may still outperform.
- Not everything can be captured, the question about what is missing is still there and sampling bias and error must still be considered: sampling bias is more impactful that sampling error, since there always the question of what underlying population has been captured by the data.
In the other words, Big Data does not equal Big Insights: science, deep reasoning and proper inferencing are as necessary as ever, and statisticians are beginning to modify and fine-tune their toolsets: as a remedy, I predict that tools from the Automated Reasoning field will also be increasingly adopted to fight this data avalanche.
A graphical summary to Caspers Jones’ latest book, “The Technical and Social History of Software Engineering”, aggregating the data of thousands of projects:
- Note how application size is lowering in terms of number of lines of code, in direct correlation to the linear increase in the expressive power of programming languages. This observation fits well the growing number of web/mobile application that only do a very limited number of functions.
- The maximum percentage of code reuse is growing very fast, due to a higher number of libraries and open-source, but spotting projects with a 85% of reuse is a yet a rarity.
- Defect removal efficiency has steadily improved, but I expected a steeper line due to static analysis and better compiler warnings
- The percentage of personal dedicated to maintenance has surpassed that of the initial development, but there’s little research on the success factors of this stage.
As languages improved (and their number, so more languages are available for specific tasks), so did the programmer’s productivity, lowering the defect potential at the same time: this document about software engineering laws also provides another interesting outlook of the same datasets.
Imagine devising a set of rules for a game such that the dominant strategy of every player is to truthfully reveal their valuations and/or strategies: this is just one of the ambitious goals of mechanism design, the science of rule-making and the most useful branch of game theory. Fifteen years ago, a pioneering paper of Nisam and Ronen (Algorithmic Mechanism Design) merged it with computer science by including the requisite that computations should also be reasonably tractable for every involved player: this created a fruitful field of research that contributed every tool of algorithmics and computational complexity, from combinatorial optimization and linear programming to approximation algorithms and complexity classes.
In practice, Algorithmic Mechanism Design is also behind the successes of the modern Internet economy: every ad-auctions uses it results, like Google’s DoubleClick auctions or Yahoo’s Auctions, and peer-to-peer networks and network protocols are being designed under its guiding principles. It has also contributed to spectrum auctions and matching markets (kidneys, school choice systems and medical positions) and it has also generated interesting models, like the first one that justifies the optimality of the fee-setting structure of real estate agents, stock brokers and auction houses (see Fee Setting Intermediaries).
Up until a decade ago, the only way to learn this fascinating field of research was by venturing to read papers dispersed between the areas of economics, game theory and computer science, but this changed in 2008 with the publication of the basic textbook of the field, Algorithmic Game Theory, also available online:
Now a bit dated, it has recently been complemented with some great resources:
- The Handbook of Market Design, of which the part I have liked the most is the one on experiments.
- The online courses of Tim Roughgarden, a real master on choosing and presenting the best proofs of this field: Algorithmic Game Theory and Frontiers in Mechanism Design.
- The on-going writing of a more specific book, Mechanism Design and Approximation
And that’s enough to begin with: hundred of hours of learning insightful research with fantastic applications!
Computer science is changing: the amount of data available for processing is growing exponentially, and so must the emphasis towards its handling. Like the 19th century change in physics from mechanics to statistical mechanics, the new algorithms sacrifice the precision of a unique answer for the fast search of statistical properties. The following draft of a book by Hopcroft and Kannan breaks the path of what most future algorithms manuals may look like:
Heavy on proofs, many topics have been selected for their mathematical elegance, not their pragmatism. On the final version of this much anticipated book, I would love to see more content on hash algorithms, parallel algorithms, graph spanners or a more extensive discussion on Support Vector Machines.
- April 2014
- March 2014
- December 2013
- November 2013
- July 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011