Saturday, 18 October 2014

Tutorial for MapReduce Programming

http://www.cloudypoint.com/big-data/tutorial-mapreduce-programming/
Tutorial for MapReduce Programming
Introduction
Hadoop MapReduce can be defined as a software programming framework used to process big volume of data (in terabyte level) in a parallel environment of clustered nodes. The cluster consists of thousands of nodes of commodity hardware. The processing is distributed, reliable and fault tolerant. A typical MapReduce job is performed according to the following steps:

Split the data into independent chunks based on key-value pair. This is done by Map task in a parallel manner.
The output of the Map job is sorted based on the key values
The sorted output is the input to the Reduce job. And then it produces the final output to the processing and returns the result to the client.
MapReduce Framework
The Apache Hadoop MapReduce framework is written in Java. The framework consists of master-slave configuration. The master is known as JobTracker and the slaves are known as TaskTrackers. The master controls the task processed on the slaves (which are nothing but the nodes in a cluster). The computation is done on the slaves. So the compute and storages nodes are the same in a clustered environment. The concept is ' move the computation to the nodes where the data is stored', and it makes the processing faster.

MapReduce Processing
The MapReduce framework model is very lightweight. So the cost of hardware is low compared with other frameworks. But at the same time, we should understand that the model works efficiently only in a distributed environment as the processing is done on nodes where the data resides. The other features like scalability, reliability and fault tolerance also works well on distributed environment.

MapReduce Implementation
Now it is time to discuss the implementation of the MapReduce model using the Java programming platform. The following are the different components of the entire end-to-end implementation.

The client programthat is the driver class and initiates the process
The Map functionthat performs the split using the key-value pair.
The Reduce functionthat aggregate the processed data and send the output back to the client.

No comments:

Post a Comment