I was hacking around the net to look for tools and techniques to handle the data for this web scraper I was building using Python, and I found myself checking out this course on Introduction to Hadoop and Map Reduce on Udacity.
Udacity’s attractive UI and surprisingly short video lectures were persuasive enough to cause me to leave my work in the middle of the night and start learning more about Hadoop and Map Reduce.
I obviously had a fair bit of idea what Hadoop does, but its a short course and I’m always eager to be introduced to a new technology, technically.
Up till now, in the two lessons that I have taken, I learnt the very basics of Hadoop, its various verticals, like the
1. HDFS (Hadoop Distributed File System),
2. The 3 V’s that Hadoop framework needs to consider- Volume, Variety and Velocity (of data),
3. The concept of Data Nodes and Name Nodes,
4. What kinds of failures can put your project to risk,
5. Some of open source libraries created for Hadoop and their functionalities- like Pig, Hive (converts SQL queries to code), Mahout (ML library for Hadoop), Impala
6. For what kinds of problems its feasible to use Hadoop,
7. How Hadoop duplicates data to save the data from getting lost,
8. Functions of the Job Tracker and the Task Tracker
Main code for starting the processing of data after the loading of data (using $hadoop fs -put filename) into the hdfs file system is given by :-
$hadoop jar …jar_path… -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input input_dir – output ouput_dir
A peculiar thing about hadoop is that before running hadoop, an existing output directory with the same name must not be present. You have to specify a new directory for the job to successfully begin.
Also, what cloudera does is that it packages the Hadoop framework with some of the essential Open source libraries and then distributes that to the users. It is also open source and provides all kinds of support and services to the users on Apache Hadoop.
Doug cutting, the co-creater of Hadoop works at Cloudera only, it was fascisating to know that Hadoop has been named after his son’s you elephant.
The first two lessons only had quizes, but I’m eager to work on the projects that i will encounter in the next 2 lessons. I plan to do this course passively, not putting too much time into it. But this will still give me some concrete technical knowledge on how to implement Hadoop. Let’s see where it takes me.
P.S. I wrote this post to document my progress while I was bored in the metro, so I apologize if any reader was looking for something informative.