Over the past year the world of Hadoop has exploded, the world has suddenly woken up to the fact that there is a huge amount of data being collected and this could have some real value in business. So I wanted to spend a bit of time explaining what Hadoop is, how it all works and leading into some guide on working with Hadoop.
A brief history of Hadoop
Hadoop is derived of 2 google white papers on MapReduce and GFS (google file system). When google released these documents they started to explain exactly how they are able to store and process huge (think petabytes) amounts of data, in a simple, efficient way.
Doug Cutting took the information in those white papers and started developing Apache Hadoop. Hadoop consists of 2 components MapReduce and HDFS. Just as with the google white papers these systems provide the ability process and store huge amounts of data respectively.
Since that early start Hadoop has spawned a number of projects such as HBase (implementation of googles big table), ZooKeeper (node co-ordination), PIG and Hive (new ways to write MapReduce jobs) and a lot more…
Why should I care?
These days technology is everywhere and people use it for everything. When we build new computer systems we build on top of existing systems (deploying a website on apache, building an enterprise server on glassfish, etc) which already contain a lot of code. On top of that we deploy thousands of lines of our own code. On a normal day any computer system would generate a lot of data on any individual user, multiply that by the number of users in your system and all of a sudden we have gigabytes of data being produced hourly.
Inside that data, whether its being stored in flat files, databases or some other means we have a lot of valuable insight into what is going on inside our systems from anything to simple counting to complex pattern analysis of user patterns.
Now this is nothing new, people have been able to find patterns in data but what Hadoop brings is the ability to do this over commodity server hardware very easily. Thanks to HDFS you can scale your storage by adding new nodes providing simple API’s to store data with redundancy over multiple servers to ensure nothing is ever lost. MapReduce provides a very simple way to write algorithms to process this data that scales fantastically well as your data sets get bigger, analysing billions of data items, very quickly and very cheaply is why people are going mad over Hadoop!
How does it work?
I won’t explain the internals, if you want to understand that then I suggest you start with the google white papers and this book is great on the architecture of HDFS. Here I will go into how you can apply this technology to solving problems.
Have you used the Linux file system? Thats it, that is how easy HDFS is! Once you have it setup and configured redundancy in HDFS then you are good. You can copy files from a local file system into the HDFS managed file system and from there you can do what you want. HDFS will manage all the blocks (it chops up files into blocks), it ensures redundancy, if you lose a node in your cluster it will move blocks around to ensure you have enough redundancy and migrate your data as you install new nodes. Thats it!
MapReduce allows you to process your data, which thanks to HDFS is already shared over your whole cluster. MapReduce allows you to use all the CPU cycles over your cluster to crunch that data. MapReduce is a new paradigm for processing data which takes a little while to get used to, in subsequent articles I will go into depth over patterns you can use to process data (mostly in pig and hive as writing jobs in java is tedious) but for now I will give a general example.
Lets say you have a file stored on HDFS that contains a huge list of words, lots of repetition in a random order. In fact lets say that its not just huge, its HUGE, 100 gigabytes of data! How would your normally process this in code? Load the file, loop, increment a counter for each word, wait forever then get an answer. I think we can do a bit better than that. So how do we do it in a MapReduce world? Lets go through the steps.
- Load the file
- For each word emit a map
- Each word maps to a count, as we are looping on a long list for each word we then map it to the number one
- Sort each word
- In our map the word is the key, each word is emitted to a specific reducer.
- In this way we can split up our data set but also ensure each item gets processed once
- Reduce the words
- Each reducer gets a number of word:1 tuples
- We can just loop and count these
Although this is a very simple case for using map reduce the key principles are in place. By mapping our data we essentially organise by keys, as keys in a map are unique they can be processed separately, finally we can then reduce our simple data sets in a loop. Now in reality most MapReduce jobs contain a lot of various steps to achieve a goal, which means things get split down into smaller and smaller tasks, then often joined back up to answer questions. In this sense MapReduce jobs are intrinsically parallelizable without worrying about threads, locks, etc. Even for our trivial example it means each reduce step could run in parallel, which in british english is around 170,000 words. I for one would not want to try to code that!
That in essence is the power of Hadoop, massive storage & massive processing made very simple. As time goes on things of course get more complex, jobs become complicated and make use to new languages to express things better than java can, but once you have got your head around the word count above you are well on your way.
Over the coming weeks and months I’ll start blogging in depth about how to setup, run and code for Hadoop along with some of its related technologies. In the mean time I suggest you take a look over the original word count source code to understand how the above algorithm translates into real code.