You can keep a track of what we are up to on our site or our blog. Its going to be very interesting, and we expect to learn a huge deal along the way but we can’t wait to get out there and move towards kickstarter!
Leiningen is the defacto build tool for clojure projects. Installing in any corporate environment probably means you will need to install behind a proxy, which I found wasn’t that well documented even though it’s very simple with just a few steps:
Set the environment variables http_proxy and https_proxy, leiningen will honor these settings.
This weekend I decided to knock up an example of some of the great features of Clojure such as java inter-op, multi-threading, it’s data structures and its functional style of coding.
Clojure is a fantastic language that runs on the JVM, it’s a true modern language that takes into account the impact that Moore’s Law is becoming more prevalent in programming and that object orientation comes with a number of downsides that slow the progress of creating reliable, quality software quickly. If you’re new to Clojure its well worth checking out this video
I created this project to demonstrate how a simple use case can be answered using the various features in Clojure, the source code is up on github. The features demonstrated are:
The key to creating concise, reliable and quality code in Clojure is that its a LISP. Taking the step from Java to Clojure means getting used to working in a lisp, if this is your first step into programming Clojure I would suggest you checkout Try Clojure and the Clojure Koans before looking into this code, it will make a huge difference to your understanding.
To understand how this works in use, look the following line. It takes a list of tweets, filters out any tweets that have less that 5 retweets, sorts the list in order of tweets (which by default is ascending) and then reverse that list.
For anyone coming from a Java world being able to do that in a concise, clear way is a revelation.
As Clojure is built on the JVM it provides access to any libraries and code that would be available to an application written in java. This is a killer feature of Clojure, it means that it already has a huge base of libraries that it can access and use. There are a number of special forms for accessing the underlying functions which make use of the dot operator, for example you can call a function on a java object as such:
Clojure has first class support for working in multiple threads, it’s software transactional memory model ensures safe and consistent access to data. In this example the data is accessed using refs, it is read by dereferencing (using the @ symbol for shorthand):
Writes occur in a dosync block, very much like transactional writing to a database.
Clojure provides 4 basic data structures, lists, vectors, hash-maps and sets. Their mapping to java objects can be seen as such:
List – Linked List
Vector – Array List
Hash-map – hash-map
Sets – Sets
The difference between the java collections and Clojure ones are that the Clojure ones are immutable and persistent. These qualities mean that accessing them in the multithreaded manor above ensures it is safe, fast and reliable. Again, if these ideas are new to you, it’s worth watching the video above.
I wrote this code to demonstrate some of the key features of Clojure and as a stepping stone from learning the basics into seeing how these tools could be used to create a useful program in very few lines of code. Clone the repo, play around and if you think it could be improved please chuck me a pull request.
If anyone is working with a Hadoop stack you may want to make use of the extensions in PiggyBank but have come to realize this means building the source from scratch. This isn’t an easy thing, getting all the correct dependencies setup is a pain, so out of kindness here is the latest piggybank.jar as of 15th May 2013.
A lot of software companies suffer from the same problem, as they grow they fall into the trap of adding unnecessary bureaucracy. Now I’m not talking about making sure a work environment is safe, or that people are looked after, or that no bullying takes place. Maintaining a healthy workplace needs certain levels of procedures and policies.
However as organizations grow it becomes so easy to to create unnecessary policies, ones that are so tight they kill any sense of autonomy.
Lets take an extreme example of this, lets say one day someone gets caught torrenting files at work. Apart from firing them what else could happen? A policy could be set, firewalls could be turned up to max, teams could be brought in to police things, you could lock the software users have to a standard set. There’s a lot of ways to stop this, but now that one person has ruined things for the future.
Lets look at this mathematically. Lets say you bring in a firewall team, a security team and a compliance team to keep people in check, and lets say some talented developer needs to run a new service on port 8080 (really, that isn’t even an odd port!)
Now a change that would have taken no time suddenly involves 3 new people, each one bringing their own level of efficiency. If each person is 80% efficient, here’s what happens:
That’s right, by adding just 3 people to a decision, working at 80% we’ve managed to reduce overall efficiency by a half!
Now think about decisions in an organization, any modern organization. Think what it would take to change some text on a production website, or feeding back on requirements. As companies grow they bring in various departments and every single point requires more people to discuss.
Even the best of people, working at 90% can get stumped by this. Even at that pace a decision that needs 6 other people means you run at 47%.
With people working at 50% that could fall to 7% overall, essentially grinding progress to a halt
There are levels of bureaucracy that are necessary, but I fail to see how so many organizations fail to understand the most basic notions of how important to individuals and themselves autonomy really is.
From the offset I will be honest, hadoop is a nightmare to setup, its versions are all over the place, miss-matches lead to random failures and its just not a fun thing to be doing. However the nice people at Cloudera have a much easier solution to all of this, the provide a nice management interface to install your cluster. Though this is almost seamless, there are a few gotcha’s that you need to be aware of that can catch you out. So my hints are below:
The nodes in a hadoop cluster talk to each other in a lot of different ways, the number of ports you need open depending on your configuration is mind blowing, and from the way things are with hadoop its also ever changing! The shortcut to this is to shut off your firewall using a command such as:
service iptables stop
Or the equivalent for your linux version. Now I’m well aware this isn’t best practice, but if you’re just getting something up and running to test out or are hitting a brick wall and want to make sure its not a firewall problem then its a good test. Later on I’ll cover a long term fix for this.
Hadoop expects a fully working DNS setup, however this isn’t always in line with how cloud providers set up their servers. For instance my host of choice is RackSpace, who are awesome by the way, but when you setup new nodes they all get names so you can do things like:
However if you have 3 data nodes there is no way for nodes 2 or 3 to know about data node 1, or each other. If you end up in this state your hadoop cluster gets in all sorts of a mess, some systems use DNS, some use IP’s and it’s impossible to know what’s going on.
The fix for this is easy, you need a working DNS system. This can either be achieved by setting up a fully working DNS server (various cloud providers support this or roll your own on a linux box) or if you have a small cluster you can do this manually. If you edit the /etc/hosts file it will contain a list of IP to name mappings separated by tabs, such as
Now all you need to do is add IP to name mappings for any other servers in your cluster and you’re sorted.
Longer term the suggested fixes above just aren’t feasible. Shutting off the firewall is not smart and manually setting up DNS is a long process. This advice is just to help you over that first hurdle and get things working. If you plan to invest in a production hadoop cluster I suggest going with a tool such as puppet to setup your servers so they are ready for Cloudera but also secure.
As computer engineers we aim to make things simple. Recently I’ve started to notice something which I have been calling simple complexity, which manifests over and over again in new frameworks, approaches and designs. Im not sure what drives this, somehow it seems that driven by the desire to produce less lines of code or embrace the latest ideas without thinking, but it manifests like this:
Developing in clojure is an absolute pleasure, however setting it up is rarely as easy. In this guide I will take you through the steps to get OS X up and running with Clojure and Leiningen (the Clojure build system).
Over the past few years I’ve had a blog in one form or another, but I’ve never really commited to it. Yes I wrote the odd article, yes I jotted some ideas down, but never really gave it enough effort to do so consistently.
Well, that’s about to change. Its time I started working on this more seriously and giving a lot more back.