Category Archives: programming

On The Never-Ending Quest of Functional Programming

Functional programming has been in the top of the hype-cycle for decades: no side effects, methods written in a purely mathematical style, strong type systems… What’s not to like about functional programming? It’s even a loaded term! Just ask someone: “But is not your code functional , yet?”

Although the Turing machine (imperative) model of computation was shown to be equivalent to the λ‑calculus (functional), there’s actually a long gap in its asymptotics.

Start with the most simple of the data structures, the array: in a purely functional style of programming, to update an array you would need O(log N) time, but in an imperative language it’s just O(1). And this is not a specific problem with arrays, it’s the standard behaviour within pure functional programming languages due to the their overrated immutability of data: in “Pure versus Impure LISP”, it’s shown that O(N) functional programs written with lazy, non-strict evaluation semantics can be translated to a purely functional style in O(N log N). That is, at worst an O(log N) slowdown per operation may occur by simulating mutable memory access: immutability carries a very high cost, indeed.

In practice, extensions to the pure functional paradigm have been considered to address the need to directly reference and change the state of variables (monads in Haskell), and most algorithms can be implemented efficiently by using purely functional programming, but it also involves extensive and difficult knowledge (“[amazon_link id=“0521663504” target=“_blank” ]Purely Functional Data Structures[/amazon_link]”) for even the most basic of data structures, and the recognition that they require a significant slowdown by constant factors and CPU stalling due to many cache misses.

And, as if that is not enough, debugging any functional program is a quest in itself: I would say that it’s an order of magnitude more difficult than an imperative language, vastly increasing software maintenance costs.

Software Development based on Financial Ratios

Leave a reply

Why did software component reuse never take root? Why is outsourcing within the software industry such a bad practice, even if other industries outsource all of their developments? Why do big software companies perform so bad and have such a poor innovation record?

From a purely financial viewpoint, focused on profitability ratios, these and other questions are far beyond all comprehension. But paradoxically, they emerge as the result of the blind application of financial theory to the software industry:

Reusing software would look like the shortest path to profitability: selling properly tested and documented software components over and over again was the holy grail of the software industry decades ago; in fact, that’s how all the manufacturing industry works. But the incorporation of other costs (friction, learning, integration, …) discourages this model of development; and what really abolished this trend was the rejection of whole software stacks based on the elusive nature of some requirements that were not taken into account because componentized software is not built by iterating on customer feedback since its very beginning.
Outsourcing is the preferred way to develop software in every industry, except for the software industry itself: financially speaking, outsourcing improves every efficiency measure, lowers fixed costs and improves flexibility. What’s not to like? That what is difficult to measure: the cost of lost knowledge and the opportunity cost of not being able to innovate based on this knowledge; and finally, the cost of not surviving because competition was more learned and readier.
Big companies do not innovate because all their projects start to get measured by IRR (Internal Rate of Return): but this ratio prioritizes short term, low-value projects over long-term projects. Start-ups teach us that it takes 7–10 years to full profitability and investment amortization: no wonder big companies fail to innovate.

And there are many more examples: striving for a better ROCE (Return on Capital Employed) on software projects is not a very good idea, since capital is not a scarce resource, qualified labour is; so economizing on an abundant asset is absolutely wrong. And that for any other return-like ratio: they will strip off any software company and deprive it of its most precious assets.

Awesome Python Libraries

1 Reply

The best Python libraries I recommend to play with:

1. SymPy, a Python library for symbolic mathematics
2. Pattern, web mining module for Python
3. NLTK, natural language processing with Python
4. Scapy, interactive packet manipulation program
5. Bitey and llvm-py, to import and modify LLVM bytecode

Assorted Links (Software Innovations)

Leave a reply

1. Z3 open-sourced: better integration with Isabelle will surely follow after this!
2. Spanner, a globally distributed database aware of time-inconsistencies
3. Keccak, the new secure hash algorithm (SHA‑3)
4. TypeScript, an ingenious way to solve the chicken-and-egg problem of language adoption via language supersets
5. LTE Base station software: I hope this will be released really soon

Books (Programming)

Leave a reply

[amazon_link id=“1449322700” target=“_blank” ]Windows Powershell for Developers[/amazon_link]. For decades, the strongest point of Unices systems have always been its scriptability, beginning with the pipe paradigm of Unices commands introduced by the command shell (Bourne, C, KSH,…) and expanded by the capabilities of Perl/Python. But that is just to change, with the quantum leap introduced by Microsoft in next version Powershell: more than 2300 cmdlets, powerful remoting enabling distributed automation of tasks, Windows Workflows and access to almost every application via COM and .NET interfaces. All these and more, will erode and leapfrog the traditional competitive advantages of Unices systems. But to really master Powershell, it’s much better to start from the perspective of the professional developer and skip all the deficient scripting done by systems administrators. Thus this book is the perfect starting point, in that it not only shows the tips’n’tricks of Powershell, it also teaches by example how to extend applications via embedded scripts.

[amazon_link id=“3642087825” target=“_blank” ]Formal Correctness of Security Protocols (Information Security and Cryptography)[/amazon_link]. A theoretical and practical guide to the generation of formal proofs for security protocols using the inductive method, an ambitious enterprise of mixed results which is of primordial importance in a field of ever-growing complexity and numerous definitions of what is secure. Short and straight to the point, this book offers lots of code for the Isabelle theorem prover of some prominent security protocols: Kerberos IV & V, Shoup-Rubin, the Abadi-Glew-Horne-Pinkas protocol for certified mail and the non-repudiation protocol of Zhou-Gollman. The best part of this book is the last chapter, in which an honest recollection of statistics shows the effort dedicated to model each security protocol.

[amazon_link id=“142006973X” target=“_blank” ]Combinatorial Pattern Matching Algorithms in Computational Biology Using Perl and R[/amazon_link]. Pedagogical, practical and with tons of examples, it progresses from pseudo-code to Perl and R source code for the most common algorithms of this interdisciplinary field, in which the beauty of nature is left to be interpreted and apprehended with some basic computer data structures: sequences for DNA pattern matching; trees for phylogenetic and RNA reconstruction; and graphs for biochemical reactions and metabolic pathways. Although it lacks of theorems is worrisome, it certainly fits its objective target of biologists with little exposure to formal computer science.

Software as a By-Product of Organizations and their Processes

Leave a reply

The architecture of a software product and its underlying infrastructure is not totally determined by its intended functionality, but it tends to mirror the structure of the organization in which it is developed (mirroring hypothesis): this effect is so strong that an order of magnitude in component modularity is observed between the software made within tightly coupled and distributed development teams, consistent with the view that distributed teams tend to develop more modular products. That is, the ultimate software architecture is just a copycat of the communication structures of the organizations and their interactions, reflecting the quality and nature of the real-world interpersonal communications between the teams in its various degrees of integration: having a common and clear mission, their physical closeness and possessing formal authority over others to control development.

So be it, software created by distributed teams with misaligned incentives under the routine of design by committee will only give rise to wars of specifications. Human nature being what it is, will generate power-plays in the distribution of information impacting product quality, as the structure of a system tends to reflect the power relationships and status of the people and organizations involved.

The process of software design rests on a shared mental process between the software developers: the search space of its architecture is constrained by the nature of the organization within which this search happens. In closed systems, it’s widely but wrongly believed that the designs are highly modular: on the contrary, dependency density and propagation costs run high, and project schedules fall apart during component integration, specially due to the indirect system dependencies.

In the largest study to date to the arguably largest and most successful codebase in the world, the Windows operating system, it was found that organizational structure metrics were better predictors for classifying failure-prone binaries that other models using traditional metrics of code churn, code complexity, code coverage, code dependencies and pre-release defect measures. Hence, as much though is ironically given to software architecture, it turns out that a well-planned organization with the proper checks and balances is the key to reduce the amount of communication and coördination necessary for the success of software projects. And then, and only then, trust and the willingness to communicate openly and effectively, shall follow.

As it turns out, Conway’s Law, the old adage commonly invoked in Computer Science to sum up these ideas, is but a version of a much older story, the techno-reenactment of the Tower of Babel (Genesis 11:1–9).

Static vs. Formal Methods for Code Auditing

Leave a reply

Sure, it’s always feasible to write secure C code: for example, using the ternary operator (cond ? : x : y;) it’s possible to write statements so that errors are thrown whenever type mismatches are found at compile time, but it also would be very easy to accidentally bypass these software constructs. However, and for serious undertakings, it’s always better to rely on the existing tools that automate the many mind-numbing tasks that comprise a software audit.

Static code analysis features a long history (with the first lint tool dating back to 1977), but only the profusion of programmers from the last decade instructed in the paradigm of strongly-typed languages generated the expectation to get the same level of pre-execution bug detection for the weakly-typed languages. So strong is the impetus for static analysis in the C arena, that the whole Clang compiler suite is replacing gcc just because is almost impossible to develop static analysis tools for it. Nowadays, in the hall of fame for static analysis code auditing software tools, Deputy gets the top prize for the open source/free program, and Coverity for the commercial one, with Veracode being a strong-contender.

Coverity is a commercial spin-off of the academic work of Dawson Engler, whose papers are a must to not only understand how their tools work, but also other cutting-edge research subjects like symbolic execution. Coverity Static Analysis generates a higher number of false positives than Veracode’s, especially given that they charge so much, but neither of the two surpass the huge number that gets generated by the venerable PC-Lint. But this last one, and others like PVS-Studio and Parasoft C/C++ are still a must if you prefer on-premise software rather than giving away to the cloud your precious source code. Or if no external tools are available, at least consider the code analysis tool in Visual Studio.

On a different category, Deputy is an innovative software project that tries to introduce rigorous type safety based on annotations that are included within the code. I agree that it’s always a great idea to start introducing annotations whenever the codebase is in a nascent state, but it’s much more cumbersome whenever software projects are in an advanced state because it is akin to the tedious maintenance of code written by others. It also features runtime checks to detect those faults that couldn’t be detected at compile time. Or if having limited time available, you may consider just using splint instead of annotating everything.

At the end, I like to think that the most critical parts of the code should be minimized and subjected to formal verification efforts (see L4.verified) but always noting the implicit circular paradox of having to write a specification for the formal verification process, which is a just a program in itself. Because, who verifies the verifier?

Python in the Financial Markets

Leave a reply

With the SEC recommending the use of Python to report ABS cashflows and major investing banks (J.P. Morgan, Goldman Sachs, Morgan Stanley) slowly substituting their Matlab code with Python to take advantage of its much faster time-to-prototype/time-to-market, it’s an awesome moment to enjoy watching a nascent field grow up.
I’ve collected the most representative public resources on the use of Python in finance and some experiments of mine in this slide.

LMAX: Going Slow to Go Fast

Leave a reply

GDE Error: Error retrieving file — if necessary turn off error checking (404:Not Found)

Scalable web architects must learn from the transactional world, stock exchanges and other financial creatures of mass data processing. For example, LMAX attaining 100K+ TPS at less than 1 ms latency is a remarkable technical feat. Sure, other exchanges like NYSE and LSE manage to achieve higher TPS at lower latencies, however it’s the history behind LMAX what makes it a fascinating object of study: I’ve estimated from public sources that 20 people worked fully committed for three years to develop the initially released version of LMAX. At first, a simple proof of concept reaching 10K TPS was produced, followed by a long number recode-measure-debug cycles that felt more like squeezing and juicing the JVM to achieve significant speed-ups than real programming, because writing cache-friendly code for an adaptive-optimizing JIT virtual machine with no control of how data structures are mapped to memory is really hard, as nonlinearities appear everywhere in the typical code optimization process.

It’s the same kind of technical debt Google experienced when it started with a Java codebase, then migrated to a slower Python one to finally settle for C/C++; or the current technical debt at Twitter, a pure Ruby on Rails product that moved to Java and Scala with phenomenal results.

When frameworks and virtual machines get in the way, it’s the good old wisdom from people like Ken Thompson that illuminates the path to success: “One of my most productive days was throwing away 1000 lines of code.”

Android Decompilation & Obfuscation

Leave a reply

I get more than a hundred visits a day to my iPhone Decompilation & Obfuscation post, that’s why writing an Android equivalent and comparing the results between them will be so interesting to assess platform demand from developers.

To decompile an Android .apk file, you must follow the next steps:

Download the app from the Android Market to your smartphone and backup the app with a tool like Titanium to get the .apk file
Next, use apktool to get back the project file structure and resources
Then, use dex2jar to the obtain .class files from the .dex files
After that, use jd-gui or JAD to decompile the .class files
Most bytecode won’t perfectly decompile and some routines will be hard to reconstruct from the bytecode: get ready to read java ASM disassembled with smali

To obfuscate/protect your application, consider following these steps:

ProGuard is the most complete and useful tool to obfuscate applications, but you must use it with the following configuration file to avoid any problem. Note that ProGuard is pre-packed in the SDK from Android 2.3
Use LVL for your paid applications, but remember that it has already been broken.
Lastly, consider using Android NDK for the most critical code. Writing JNI code is a really cumbersome and error-prone, process that’s why using specialized tools is essential to avoid errors and speedup development: to interface C libraries with Java, try SWIG and GlueGen; in reverse, to interface Java with C try HawtJNI. It’s a pity that the Integrated Debugger for Java/JNI Environments is only available for the Apache Harmony JVM, as it really helps in the difficult Java/JNI debugging process.

As a final note, the results from the superb paper “On the (Im)Possibility of Obfuscating Programs” will always tame our aspirations in the obfuscation enterprise:

GDE Error: Error retrieving file — if necessary turn off error checking (403:Forbidden)

Code battling code battling… (ad infinitum)

Leave a reply

“Battle programs of the future will perhaps be longer than today’s winners but orders of magnitude more robust. They will gather intelligence, lay false trails and strike at their opponents suddenly and with determination.”
‑A. K. DEWDNEY, Computer Recreations, Scientific American (January 1987)

Mahjong is different from Checkers, just like High Frequency Trading is different from Core War: they include the profit motive!