more

getting on with hadoop

So yesterday i was given a little farm of servers (only 8 nodes) to do whatever i want with it. As lately i’m working on computer vision and machine learning algorithms that not only takes a lot of calculation time, but space as well (one of the dataset is 100gigs, the other is 200gigs), i’m getting into distributed computing more and more. Namely, hadoop and mahout.

All the 8 servers were running debian etch, so first things first i’ve decided to upgrade them to squeeze. As I don’t really trust dist-upgrade that much, I’ve decided not to skip lenny, so the chain of distribution upgrade was: etch -> lenny -> squeeze.

Although I’ve spent some time googling for a utility that could help me to do the upgrade at once on all 8 nodes, i couldn’t find any (if you know one please let me know for the next time). I could have written a script to do this job for me, but there were just waaaay to many questions from the dist-upgrade script each time, especially when i was upgrading from etch to lenny. To be honest the whole upgrade procedure went rather smoothly. never had to go down to the console to restart a machine, it always came back after a reboot. \o/

so finally hadoop got into the focus. I went with 0.21.0. It has quite a lot of silly bugs, like the start and stop scripts that won’t work unless you define the HADOOP_HOME env variable. Other than this, i’ve followed the short documentation from hadoop how to set up the cluster. I’ve ran into some troubles with permissions, but now finally they are up and running.

now it’ll be mahout’s turn. I’ve ran it already on my localhost for quite some time now, as I was using some of the algorithms implemented in it. but i’m really excited to see what it’ll do on a real cluster.

mahout has some great algorithms implemented in it, but for instance it does not have an SVM classifier yet. There were several intentions to implement it, like MAHOUT-227, MAHOUT-232 and MAHOUT-334 none of them are in a state that they could be included in the HEAD of the repository (see my patches for MAHOUT-232).

This motivated me to start implementing an SVM classifier from scratch for mahout. It’s still on it’s way and as you can see i was distracted by some little obstacles, but now it’s time to do it and write up the code and try to push it for HEAD. let’s see what will happen…

my other big interest is to bring computer vision algorithms in hadoop, as i’m working with heaps of images and most of the algorithms could be parallelized. some people have implemented one or two algorithms in mapreduce framework, but i would be great to see a framework like mahout (or opencv and itk from CV part) for hadoop.

expand it

Commenting is closed for this article.