From the offset I will be honest, hadoop is a nightmare to setup, its versions are all over the place, miss-matches lead to random failures and its just not a fun thing to be doing. However the nice people at Cloudera have a much easier solution to all of this, the provide a nice management interface to install your cluster. Though this is almost seamless, there are a few gotcha’s that you need to be aware of that can catch you out. So my hints are below:
The nodes in a hadoop cluster talk to each other in a lot of different ways, the number of ports you need open depending on your configuration is mind blowing, and from the way things are with hadoop its also ever changing! The shortcut to this is to shut off your firewall using a command such as:
service iptables stop
Or the equivalent for your linux version. Now I’m well aware this isn’t best practice, but if you’re just getting something up and running to test out or are hitting a brick wall and want to make sure its not a firewall problem then its a good test. Later on I’ll cover a long term fix for this.
Hadoop expects a fully working DNS setup, however this isn’t always in line with how cloud providers set up their servers. For instance my host of choice is RackSpace, who are awesome by the way, but when you setup new nodes they all get names so you can do things like:
However if you have 3 data nodes there is no way for nodes 2 or 3 to know about data node 1, or each other. If you end up in this state your hadoop cluster gets in all sorts of a mess, some systems use DNS, some use IP’s and it’s impossible to know what’s going on.
The fix for this is easy, you need a working DNS system. This can either be achieved by setting up a fully working DNS server (various cloud providers support this or roll your own on a linux box) or if you have a small cluster you can do this manually. If you edit the /etc/hosts file it will contain a list of IP to name mappings separated by tabs, such as
Now all you need to do is add IP to name mappings for any other servers in your cluster and you’re sorted.
Longer term the suggested fixes above just aren’t feasible. Shutting off the firewall is not smart and manually setting up DNS is a long process. This advice is just to help you over that first hurdle and get things working. If you plan to invest in a production hadoop cluster I suggest going with a tool such as puppet to setup your servers so they are ready for Cloudera but also secure.