Hadoop@Azure: a Recipe to Win the Big Data Race

Apache Hadoop, a rework by Yahoo of the Google File System and MapReduce, has become the lingua franca of the Big Data movement. By the trail of Google’s success, the MapReduce paradigm manages to reach a successful balance between a reasonable developer learning curve, scalability and fault-tolerance for storing and querying very large datasets.

But the true history and community efforts behind Hadoop is much more complex: Google left behind the constraints of GFS and MapReduce long time ago to more efficient and capable technologies. And so did other open-source projects: Hive and Pig to carry BI/analytics queries in a low-latency fashion, matching Google’s BigQuery; HBase and Storm for real-time search and incremental indexing, substituting the Percolator engine at Google and the BigTable store; and Giraph, for carrying out large-scale graph-processing computations with significant speedups, like the Pregel framework at Google.

So, when talking about Hadoop, you have to distinguish between the core Hadoop stack (HDFS, ZooKeeper) and the growing number of projects surrounding this core. Many companies are creating Hadoop distributions (Cloudera, HortonWorks, MapR), capitalizing on the growing need for commercial support by integrating a subset of those projects in an easy-to-install package. Except that they will never include anything innovative nor experimental, the main reason behind their existence just being support and not cutting-edge research and development: groundbreaking developments will always happen outside them, since support costs always grow higher with new lines of code than their initial development cost.

Here, the parallel with Linux ecosystem and distributions is clear, since fragmentation between distributions will severely break interoperability in the very same way: currently, no one has proposed an equivalent for the Hadoop ecosystem to the Linux Foundation, the entity that proposes the Linux Standard Base and the Filesystem Hierarchy Standard. But then again, the incentive between the Hadoop distributions is to create lock-in, which should increase fragmentation, not the other way around.

Hadoop on Azure is just another distribution, and the comparative advantage of using this or any other distribution is null if they don’t include the full list of revolutionary projects that extend the core Hadoop stack to enable the most innovative ventures. It’s not about the Azure’s price of the underlying storage or computation. The key thing is to develop and openly provide all the wrapper and glue-code that now is individually created by every user of these projects: that will be a real challenge to Microsoft Open Technologies.

The Books of Debts

No matter how hard times could get: in hindsight, everything is forgotten and hope replaces every glimpse of prudent rationality. But reading some carefully selected books is the perfect antidote to get back some good’n’old common sense:

  • [amazon_link id=“1933633867” target=“_blank” ]Debt, the First 5000 Years[/amazon_link]. A history of debt through the different cultures and civilizations. Although dichotomous and highly controversial in its moral judgments, it outstandly debunks myths like the prime role of money over the debt or the dual nature of debt as an instrument of commerce and finance, and perfectly portrays the cult of personal honor as the root of the economy through the ages. You better skip most of the narrative and go directly to the more academic sources cited in the references.
  • [amazon_link id=“0230365353” target=“_blank” ]Manias, Panics and Crashes: A History of Financial Crisis[/amazon_link]. A reference work, revered by its insights and the lasting impact of its anecdotes. Entirely literary and qualitative, it was the first to illustrate that crisis do follow predefined patterns by the carefully picked descriptions of past debacles, though it lacks a general theory of their formation and development.
  • [amazon_link id=“0691152640” target=“_blank” ]This Time Is Different[/amazon_link]. It’s a wonderful masterpiece of the cliometric school, born to the power of the personal computer to carry out hundreds of regressions: contrary to the previous book, it offers a quantitative study of financial crisis over centuries and continents, a view far away from the traditional equilibrium models of the economy. Frequentist and predictive by its nature, it fails at ignoring that crisis may have roots different to a failure in the saving-to-investment mechanism that it forcibly ascribes to, even if the first 200 pages are dedicated to a fully detailed taxonomy of financial crisis.

Assorted Links (Comp. Security)

    1. German Federal Government intelligence agencies can decrypt PGP (German)
    2. Breakthrough silicon scanning discovers backdoor in military chip and Rutkowska’s essay on Trusting Hardware
    3. A closer look into the RSA SecureID software token
    4. Off-Path TCP Sequence Number Inference Attack
    5. Fixing SSL: the Trustworthy Internet Movement
    6. Alan Turing’s Wartime Research Papers: Statistics of Repetitions and On the Applications of Probability to Cryptography

Software as a By-Product of Organizations and their Processes

The architecture of a software product and its underlying infrastructure is not totally determined by its intended functionality, but it tends to Confusion of Tonguesmirror the structure of the organization in which it is developed (mirroring hypothesis): this effect is so strong that an order of magnitude in component modularity is observed between the software made within tightly coupled and distributed development teams, consistent with the view that distributed teams tend to develop more modular products. That is, the ultimate software architecture is just a copycat of the communication structures of the organizations and their interactions, reflecting the quality and nature of the real-world interpersonal communications between the teams in its various degrees of integration: having a common and clear mission, their physical closeness and possessing formal authority over others to control development.

So be it, software created by distributed teams with misaligned incentives under the routine of design by committee will only give rise to wars of specifications. Human nature being what it is, will generate power-plays in the distribution of information impacting product quality, as the structure of a system tends to reflect the power relationships and status of the people and organizations involved.

The process of software design rests on a shared mental process between the software developers: the search space of its architecture is constrained by the nature of the organization within which this search happens. In closed systems, it’s widely but wrongly believed that the designs are highly modular: on the contrary, dependency density and propagation costs run high, and project schedules fall apart during component integration, specially due to the indirect system dependencies.

In the largest study to date to the arguably largest and most successful codebase in the world, the Windows operating system, it was found that organizational structure metrics were better predictors for classifying failure-prone binaries that other models using traditional metrics of code churn, code complexity, code coverage, code dependencies and pre-release defect measures. Hence, as much though is ironically given to software architecture, it turns out that a well-planned organization with the proper checks and balances is the key to reduce the amount of communication and coördination necessary for the success of software projects. And then, and only then, trust and the willingness to communicate openly and effectively, shall follow.

As it turns out, Conway’s Law, the old adage commonly invoked in Computer Science to sum up these ideas, is but a version of a much older story, the techno-reenactment of the Tower of Babel (Genesis 11:1–9).

The Need for Speed^WCapacity

Computer networks are used every day, but with a very limited understanding of the consequences of their cumulative aggregation. Network coding is the field that devises techniques for their optimal utilization to reach the maximum possible transmission rate in a network, under the assumption that the nodes are somewhat intelligent and able to alter the network flow and not just to forward it. It’s still nascent, so the practical impact of its results is quite limited: for example, it would very useful to have techniques and a tool to estimate the real network capacity in a multicast/P2P network, except it’s still an open problem. Fortunately, the following paper offers the first worthy approach to close this question:

Download (PDF839KB)

Although to be resistant to common Internet attacks, network coding should be accompanied with homomorphic signatures.

Assorted Links (Economics)

    1. Cartels are also an emergent phenomenon
    2. Excellent dashboards: What kind of revenue does it take to go public? and Do Tech IPOs Always Fall?
    3. Lack of profitability limited the early distribution of the barcode scanner
    4. Customer Lifetime Value techniques: ARPU-based, cohort-based,  and Bayesian-based methodologies
    5. Institutions and Technology: Law, as Much as Tech, Made Silicon Valley
    6. When Theory Matches Reality: AMD goes fabless as they don’t spur Intel to innovate more

Towards Optimal Software Adoption and Distribution

Since the very beginning of software industry, it’s always been the same: applying the most innovative ways towards lowering the friction costs of software adoption is the key to success, especially in winner-takes-all market and platform-plays.

From the no-cost software bundled with the old mainframes to the freeware of the ‘80s and the free-entry web applications of the 90’s, the pattern is clear: good’n’old pamphlet-like distribution to spread software as it were the most contagious of ideas.

It comes to the realization that the cost of learning to use some software is much higher than the cost of software licenses; or that it’s complementary to some more valuable work skills; or that the expected future value from owning the network created by its users would be higher that selling the software itself. Never mind, until recently, little care has been given to reasoning from first principles about the tactics and strategies of software distribution for optimal adoption, so the only available information are practitioner’s anecdotes with no verifiable statistics, let alone a corpus of testable predictions. So, it’s refreshing to find and read about these matters from a formalized perspective:

 

Download (PDF, 1.23MB)

The most remarkable result of the paper is that, in the case of a very realistic scenario of random spreading of software with limited control and visibility over who gets the demo version, an optimal strategy is offered with conditions under which the optimal price is not affected by the randomness of seeding: just being able to identify and distribute to the low-end half of the market is enough for optimal price formation, since its determination will depend on the number of distributed copies and not on the seeding outcome. But with multiple pricing and full control of the distribution process (think registration-required freemium web applications) the optimal strategy is to charge non-zero prices to the higher half-end of the market, in deep contrast with the single-digits percentage of the paying customers in real world applications, which suggest that too much money is being left on the table.

Assorted Links (Theory)

    1. John Nash’s Letter to the NSA (in the same premonitory spirit of Gödel’s letter)
    2. Computing 10,000x more efficiently
    3. Superexponential long-term trends in technological progress
    4. Superb discussions on the practical feasibility of quantum computing while IBM deeps into the future: Perpetual motion of the 21st century?, Flying machine of the 21st century?, Nature does not conspire and The Quantum Super-PAC
    5. Consensus Routing: the Internet as a Distributed System
    6. 30 years since the BBBW protocol, the first quantum cryptographic protocol, and 4093 patents later.

Book Recommendations

[amazon_link id=“1420075187” target=“_blank” ]Cryptanalysis of RSA and it variants[/amazon_link]. It’s always fascinating how even a simple set of equations can give rise to some many cryptanalytic attacks, and just by looking for some corner cases: small public and private exponents, combined with the leakage of private parameters and instantiations sharing common modules or private exponents. To prevent these attacks, variants were also invented: like using the Chine Remainder Theorem during the decryption phase; or using modulus of special forms or multiple primes; plus choosing primes p and q of special forms or the dual instantiation of RSA. If I wouldn’t have read the hundreds of papers covering these topics, I would have loved to start with his book.

[amazon_link id=“1593273886” target=“_blank” ]The Tangled Web[/amazon_link]. The web is the biggest kludge ever: a chaotic patchwork of technologies with security added as an afterthought. Understanding the details and motivation behind each security feature is no small feat whatsoever, an effort that can only be carried out by someone, like the author, well battled on exploiting them through the years. Reviewing the entire browser security model through its history it’s the only way to get a full understanding of how things have come to be the way they are, and this is the definitive guide to understand how complexity quickly builds up in security front when it’s not been planned since the beginning.