Introduction to Hadoop

Over the past year the world of Hadoop has exploded, the world has suddenly woken up to the fact that there is a huge amount of data being collected and this could have some real value in business. So I wanted to spend a bit of time explaining what Hadoop is, how it all works and leading into some guide on working with Hadoop.

A brief history of Hadoop

Hadoop is derived of 2 google white papers on MapReduce and GFS (google file system). When google released these documents they started to explain exactly how they are able to store and process huge (think petabytes) amounts of data, in a simple, efficient way.

Doug Cutting took the information in those white papers and started developing Apache Hadoop. Hadoop consists of 2 components MapReduce and HDFS. Just as with the google white papers these systems provide the ability process and store huge amounts of data respectively.

Since that early start Hadoop has spawned a number of projects such as HBase (implementation of googles big table), ZooKeeper (node co-ordination), PIG and Hive (new ways to write MapReduce jobs) and a lot more…

Why should I care?

These days technology is everywhere and people use it for everything. When we build new computer systems we build on top of existing systems (deploying a website on apache, building an enterprise server on glassfish, etc) which already contain a lot of code. On top of that we deploy thousands of lines of our own code. On a normal day any computer system would generate a lot of data on any individual user, multiply that by the number of users in your system and all of a sudden we have gigabytes of data being produced hourly.

Inside that data, whether its being stored in flat files, databases or some other means we have a lot of valuable insight into what is going on inside our systems from anything to simple counting to complex pattern analysis of user patterns.

Now this is nothing new, people have been able to find patterns in data but what Hadoop brings is the ability to do this over commodity server hardware very easily. Thanks to HDFS you can scale your storage by adding new nodes providing simple API’s to store data with redundancy over multiple servers to ensure nothing is ever lost. MapReduce provides a very simple way to write algorithms to process this data that scales fantastically well as your data sets get bigger, analysing billions of data items, very quickly and very cheaply is why people are going mad over Hadoop!

How does it work?

I won’t explain the internals, if you want to understand that then I suggest you start with the google white papers and this book is great on the architecture of HDFS. Here I will go into how you can apply this technology to solving problems.


Have you used the Linux file system? Thats it, that is how easy HDFS is! Once you have it setup and configured redundancy in HDFS then you are good. You can copy files from a local file system into the HDFS managed file system and from there you can do what you want. HDFS will manage all the blocks (it chops up files into blocks), it ensures redundancy, if you lose a node in your cluster it will move blocks around to ensure you have enough redundancy and migrate your data as you install new nodes. Thats it!


MapReduce allows you to process your data, which thanks to HDFS is already shared over your whole cluster. MapReduce allows you to use all the CPU cycles over your cluster to crunch that data. MapReduce is a new paradigm for processing data which takes a little while to get used to, in subsequent articles I will go into depth over patterns you can use to process data (mostly in pig and hive as writing jobs in java is tedious) but for now I will give a general example.

Lets say you have a file stored on HDFS that contains a huge list of words, lots of repetition in a random order. In fact lets say that its not just huge, its HUGE, 100 gigabytes of data! How would your normally process this in code? Load the file, loop, increment a counter for each word, wait forever then get an answer. I think we can do a bit better than that. So how do we do it in a MapReduce world? Lets go through the steps.

  • Load the file
  • Loop
  • For each word emit a map
    • Each word maps to a count, as we are looping on a long list for each word we then map it to the number one
    • Word:1
  • Sort each word
    • In our map the word is the key, each word is emitted to a specific reducer.
    • In this way we can split up our data set but also ensure each item gets processed once
  • Reduce the words
    • Each reducer gets a number of word:1 tuples
    • We can just loop and count these

Although this is a very simple case for using map reduce the key principles are in place. By mapping our data we essentially organise by keys, as keys in a map are unique they can be processed separately, finally we can then reduce our simple data sets in a loop. Now in reality most MapReduce jobs contain a lot of various steps to achieve a goal, which means things get split down into smaller and smaller tasks, then often joined back up to answer questions. In this sense MapReduce jobs are intrinsically parallelizable without worrying about threads, locks, etc. Even for our trivial example it means each reduce step could run in parallel, which in british english is around 170,000 words. I for one would not want to try to code that!

That in essence is the power of Hadoop, massive storage & massive processing made very simple. As time goes on things of course get more complex, jobs become complicated and make use to new languages to express things better than java can, but once you have got your head around the word count above you are well on your way.

Next steps

Over the coming weeks and months I’ll start blogging in depth about how to setup, run and code for Hadoop along with some of its related technologies. In the mean time I suggest you take a look over the original word count source code to understand how the above algorithm translates into real code.

O2 Send your mobile number to every website you visit!

Today in my RSS feed an interesting article appeared, stating that O2 send your mobile number to every webpage you visit. I couldn’t believe this so decided to spend 5 minutes confirming that they do and it’s stupidly easy for anyone to get your number to spam you.

A bit of background.

When you send a request to a website for a page you send along a number of headers stating things like what device you are on, what screen size you have and what features your browser support. This means web pages can change to show you less images for your mobile, some even change the formatting to make it easier to read, all very good stuff.

However this news reveals that O2 add an extra header in that tells the website what your mobile number is! Think about how many websites you visit, think about how many people could grab your mobile number from this! I spent all of 30 seconds writing the following script to dump out the http headers to the screen to check this, here is the entire script:

You can see this script running here:

Just make sure you visit over 3G not WiFi and you will see your mobile number listed there in the headers! It’s called HTTP_X_UP_CALLING_LINE_ID.

Don’t worry I don’t record any headers or log this information, but other websites could!

This probably explains how spammers keep getting your mobile number. I suggest you call O2 immediately and complain, I know I will.

Full credit for the original discovery and info is here:

Loading the OSGi Web Console into Glassfish

For a while now Glassfish has been built on OSGi, which is a fantastic win for a lot of developers working with enterprise java. Want to include modular OSGi software in your stack without having to install new servers and runtimes? No problem to those of you running Glassfish!

Here ill go through the process of starting the OSGi web console (which is like the normal admin screens, but for OSGi). Soon ill be posting a few guides on setting up a stats reporting system which will allow you to easily log events from within your code into a Graphite server.


Are your running version 3 of glassfish? Then your all done!

Unfortunately those of you on older versions of Glassfish are out of luck, this is V3 only.


You can deploy bundles into Glassfish in 4 ways, these are via asadmin commands, osgi remote shell commands, auto deploy file locations and the web console.

Ideally we want to only use the web console to make life easy, but to set that up we need to deploy the web console itself. I would note that for more involved work, CI servers, scripted deployment, etc that command line and auto deploy do come into their own, but for our simple needs right now the web console is a great start.

Glassfish will automatically deploy any bundles it can find in the location:

<install path>/glassfish/domains/<your domain>/autodeploy/bundles/

All you need to do is drop your bundles in there and Glassfish will deploy them, if you remove them they get undeployed and you can even do this with the server running if you want. Therefore to setup the web console you just need to get the latest full web console jar from the Felix downloads page and drop it in here.

Thats it, if you check the logs you should see it installed and was started!


Finally all thats left to do is log into the Glassfish web console, go to your server at the following address:


You will need to login, the default user and pass is admin/admin and you should end up at the following screen.

The OSGi Web Console

Next up, will be creating some OSGi bundles that will allow you to extend Glassfish…


dOSGi – The network is the computer

OSGi is fast becoming the de-facto way to create modular java applications, however when running large scale enterprise applications often the bottleneck becomes the computer itself. Usually the culprits are RAM & CPU, eventually you hit a limit and things just won’t go any faster.

So whats the point in modular code if each computer works in isolation? Yes you may have very flexible software, well design, decoupled and all those other things, but if your not solving issues for your customers then what does it really matter? Remember real artists ship!

dOSGi aims to take all those cool things that OSGi provides an easy way to share them out over your computers, hence the d for distributed. The idea of deploying components and wiring them up as you want, independent of what computer they were run on really does sound like the future, so it was time to boot up eclipse and set myself a challenge. To do a basic evaluation of the performance aspects of dOSGi.

dOSGi ships with a couple of ways to make your computers talk to each other, but the most suited for general comm’s without too much hassle is the web service implementation from CXF. So I decided to see if I can get distributed components talking with this system, and see how fast it is.

The App

The application itself formed a very simple OSGi app, my service has one method to log a report event, this method just prints some information to standard out. A separate project holds the interface definition for my service and 2 more projects are setup to place calls to the service locally and remotely over web services.

Lesson 1

dOSGi is built around declarative services within OSGi, my previous work had been with Spring DM so it was a real joy to see how easy declarative services are. I may see how I can wire things with Spring DM at some point, but for now be prepared to create your xml files for wiring things up in your OSGI-INF folder.

Wiring up a local service was easy, as with most OSGi you just declare its interface, implementation and your done, like this:

What I wasn’t prepared for would how easy it would be to turn my local service into a remotely accessible web service, a few property key’s and its done…

Lesson 2

Unless you setup your projects properly, your in for a lot of frustration. The errors coming out we’re not helpful so here’s a list of the 2 important things you need to remember to do:

  1. Include the equinox DS libraries and dependancies, its simple but forgetting to do this means no services get hooked up. However you only find that out when you make a call and get a null pointer.
  2. If your building, include your OSGI-INF folder in your, if you forget this all your DS wiring will be missing from you jar file

Lesson 3

Invoking services is just as easy as declaring them! You can declare to insert services in your standard DS wiring files, however thanks to an extra file stored at OSGI-INF/remote-service/remote-services.xml you can re-route the calls to remote services.


Now, in all honesty, i’ve got a lot of work left on performance analysis before I can make a full comment on how this all works. The case I have picked out is probably one of the most simple as theres no complex objects, large sizes of objects, exceptions, etc. But it does serve as a start point to answer the question of if dOSGi looks like it could be a workable distributed systems architecture…

My method of testing the reporting system was to run 1000 calls to the reportEvent() function both locally within the same JVM and remotely from another JVM. This serves to test the remote invocation and serialisation functionality without including things like network IO etc.

Note: My actual test function places a call, waits 10ms and then repeats. This was to simulate more realistic call patterns and avoid unfair bias towards the local calls, if for example the network port needs to flush etc. It seemed worthless seeing what a service was like when being DDOS’d!

The results

1000 Local Calls – 10901ms

1000 Remote Calls – 13379ms

If you trim the sleep’s from those calls you can see that the remote calls are around 4 times slower than local invocation.


This is not a scientific test, this is a way for me to see if this technology could be a future building block of modular, distributed systems. I was expecting the time it takes to serialise to xml and back to be a lot worse, but I have to say a 4 times drop in speed is not bad.

As I stated before, when building distributed systems, often you are doing this to avoid problems like CPU, RAM limits so you can accept some performance loss as it may be better than the alternative of waiting for going to disc. For applications that require the highest of performance there are far better ways to solve such problems, but then again you probably don’t need highly modular solutions.

In summary I am really surprised with the performance characteristics of dOSGI, the simplicity with which it works and just how fast you can develop very complicated systems (but maintain tidy modular code). Apache has a binary XML format as well within CXF so hopefully in the future that drops in to give performance gains, in the meantime I think I need to build a more complex solution and test out on some cloud servers.