MapReduce – Distributing Your Processing Power

MapReduce (MR) is one of the core features of the Hadoop ecosystem which works in accordance with YARN (Yet Another Resource Negotiator). This is an out-of-the-box solution inbuilt in Hadoop to distribute the processing of  data across multiple clusters. MR divides the data, saves it into multiple partitions and then processes it. The Mapping transforms the data and makes it into readable data. The Reducer joins the data together, for our understanding. The MR also has features to handle  unseen problems like a Hadoop node shutting down, or a node becoming slow to ensure effective processing of the job.

Understanding a Problem  and addressing it with a Map reduce solution:

The example given below is that of data of a hypothetical departmental store which has it customers information.

Assuming this store is a big store like Walmart / Tesco, we know that the data will be huge. The store management wants to know the count of the employment for some analysis. The old school approach could be to dump this data onto a SQL and do a select count query by grouping on the `employment’ column. For a large set of data, this will be a slow operation.

Related: We help you simplify the storage, security, versioning, workflows and management of content.

The problem can be converted to a Map Reduce problem. The first step to this would be to map the data to a key value pair that would give some insights. Key will be the employment and the value will be the count. The Mapper will read each line of this data file and create a name value pair. The name value pair can be also repeated. Based on the data above,  the Mapper will convert the data into

Management ->1

Technician -> 1

Entrepreneur->1

Blue-collar->1

Unknown -> 1

Management->1

So on and so forth

This Mapper data will be flying on the cluster. After the Mapping function,  the Map-Reduce key value pair will be shuffled and sorted automatically . So the Mapping data  shown above will become:

Blue collar -> 1,

Entrepreneur->1,

Management ->1,1,

Technician->1,

Unknown->1

Now  the Reducer will read each of the Map Keys and do a sum of it.

So the data will become:

Blue collar -> 1,

Entrepreneur->1,

Management ->2,

Technician->1,

Unknown->1

N.B.: We are just focusing on the first few rows of the data table shown above. On an actual implementation the counts will vary.

So to put it in a nut shell, the Mapper and reducer have done the following:

Input data -> Mapping to key Value Pair(Mapper) -> Shuffle and Sort -> Processing the Mapped Data(Reducer)

Now let us write a small php program that can perform this Mapper and Reducer job. The Hadoop ecosystem is built over an opensource stack so it can work with maximum programming languages like python, perl, java , .net etc.

Let us assume that we have our data file in our server and is saved as CustomerData.txt and this data is a comma delimeted data

Create a new directory on the HDFS partition

hdfs fs- mkdir customerData

Copy the data file to the hdfs directory

hadoop fs -copyFromLocal CustomerData.txt customerData/CustomerData.txt

Mapper.php

<?php
 // iterate through lines
 while($line = fgets(STDIN)){
 $line = trim($line);
 $explodedArray= explode(“,”,$line);
 $employment = $explodedArray[1];
 printf("%st%dn", $employment, 1);

}
?>

Reducer.php

<?php

$employmentCountArray=array();
while($line = fgets(STDIN)) {
$line=trim($line);
// split line into key and count
list($employment,$count) = explode(“t”, $line);
If(!in_array($employment,$employmentCountArray)){
$employmentCountArray[$employment]=$count;
}
else{
$employmentCountArray[$employment]=$employmentCountArray[$employment]+count;

}
foreach( $employmentCountArray as $employment => $count){

echo $employment.”->”.$count.”n”;
}
unset($employmentCountArray);

}
?>

On a Hadoop cluster this will be executed as:

hadoop jar <<path to >>hadoop-streaming-<version>.jar

-mapper “mapper.php”

-reducer “reducer.php”

-input “customerData/CustomerData.txt”

-output “customerData/CustomerCount.txt”

The output file can be viewed using hadoop fs-cat command

For More information on MapReduce and it usages kindly refer to: https://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Author

  • Anubhav Jha is experienced in web development and specializes in languages such as PHP, MySQL, Linux and Java. He specializes in geo-spatial databases, CRM designing and search algorithms. His technology interests span machine learning, NLP, Analytics and Big Data.