MapReduce – Distributing Your Processing Power

MapReduce (MR) is one of the core features of the Hadoop ecosystem which works in accordance with YARN (Yet Another Resource Negotiator). This is an out-of-the-box solution inbuilt in Hadoop to distribute the processing of  data across multiple clusters. MR divides the data, saves it into multiple partitions and then processes it. The Mapping transforms the data and makes it into readable data. The Reducer joins the data together, for our understanding. The MR also has features to handle  unseen problems like a Hadoop node shutting down, or a node becoming slow to ensure effective processing of the job.

Understanding a Problem  and addressing it with a Map reduce solution:

The example given below is that of data of a hypothetical departmental store which has it customers information.

Assuming this store is a big store like Walmart / Tesco, we know that the data will be huge. The store management wants to know the count of the employment for some analysis. The old school approach could be to dump this data onto a SQL and do a select count query by grouping on the `employment’ column. For a large set of data, this will be a slow operation.

Related: We help you simplify the storage, security, versioning, workflows and management of content.

The problem can be converted to a Map Reduce problem. The first step to this would be to map the data to a key value pair that would give some insights. Key will be the employment and the value will be the count. The Mapper will read each line of this data file and create a name value pair. The name value pair can be also repeated. Based on the data above,  the Mapper will convert the data into

Management ->1

Technician -> 1



Unknown -> 1


So on and so forth

This Mapper data will be flying on the cluster. After the Mapping function,  the Map-Reduce key value pair will be shuffled and sorted automatically . So the Mapping data  shown above will become:

Blue collar -> 1,


Management ->1,1,



Now  the Reducer will read each of the Map Keys and do a sum of it.

So the data will become:

Blue collar -> 1,


Management ->2,



N.B.: We are just focusing on the first few rows of the data table shown above. On an actual implementation the counts will vary.

So to put it in a nut shell, the Mapper and reducer have done the following:

Input data -> Mapping to key Value Pair(Mapper) -> Shuffle and Sort -> Processing the Mapped Data(Reducer)

Now let us write a small php program that can perform this Mapper and Reducer job. The Hadoop ecosystem is built over an opensource stack so it can work with maximum programming languages like python, perl, java , .net etc.

Let us assume that we have our data file in our server and is saved as CustomerData.txt and this data is a comma delimeted data

Create a new directory on the HDFS partition

hdfs fs- mkdir customerData

Copy the data file to the hdfs directory

hadoop fs -copyFromLocal CustomerData.txt customerData/CustomerData.txt


 // iterate through lines
 while($line = fgets(STDIN)){
 $line = trim($line);
 $explodedArray= explode(“,”,$line);
 $employment = $explodedArray[1];
 printf("%st%dn", $employment, 1);




while($line = fgets(STDIN)) {
// split line into key and count
list($employment,$count) = explode(“t”, $line);

foreach( $employmentCountArray as $employment => $count){

echo $employment.”->”.$count.”n”;


On a Hadoop cluster this will be executed as:

hadoop jar <<path to >>hadoop-streaming-<version>.jar

-mapper “mapper.php”

-reducer “reducer.php”

-input “customerData/CustomerData.txt”

-output “customerData/CustomerCount.txt”

The output file can be viewed using hadoop fs-cat command

For More information on MapReduce and it usages kindly refer to:

Hadoop Distributed File System – An Overview

Hadoop Distributed File System (HDFS) is the file system on which Hadoop stores its data. This is the underlying technology that helps the data to be stored in the distributed manner across the cluster. It helps applications to get access to the data for fast mining/analyzing and users can be assured that the data that is saved on the HDFS file system is without any corruption.

HDFS is usually used for storing and reading large files. These large files can be a continuous stream of data from a web server or a vehicle GPS data or even the pulse beat data of a patient. These large files can be easily stored across multiple clusters in a distributed manner and HDFS helps us in achieving this. HDFS decomposes the large file into small blocks (which by default are of 128 MB size). In fact, HDFS can even split the processing of such large files to multiple nodes / servers. So each server processes the small portion of a large file in parallel. These blocks are spread across each node in the cluster. The HDFS makes copies of these blocks, so in case the block on a server is corrupt HDFS can quickly regenerate it from the backup block so there is minimal data loss. The backup of the block is also on another node as well.

Related: Leverage our Big Data Services to Get Insights From your Structured and Unstructured Data Repositories.

In a generic HDFS architecture there is a Name node and a Data node. The Name node keeps the address of the small blocks that the file is split into. It keeps the address translation of the blocks to identify which node and which block to read the file chunk from. The Name node also has a edit log for audit purpose. The Data node is the place where actual files blocks are stored. Whenever a file is requested, this data node will return the file content. The Data nodes talk to each other and are in continuous sync to update the file blocks using real-time replication.

If we have to read a file in HDFS, a message is sent to the Name node, and the name node will reply with the data node and the block information. Now the client application can connect to those specific data nodes and the data nodes can give them the file block that is being requested. There are client library in Python Java and other programming languages which can do this job.

To write a file to HDFS, a message is first sent to the Name node to create a new entry for that file. The client application will then give this information to a single data node, and then the data node will replicate the information with other data nodes in real time fashion. Once the file is stored the data node sends an acknowledgment to the Name node via the client application so that the Name node updates the information on the file blocks and the Data nodes that it is stored on.

The Name node is a very important candidate of the HDFS architecture. Using the edit log of the Name node we can rebuild the Name node. The edit log has the metadata of the data that can help us to create a new Name node. We can also have a secondary Name node which contains the merged copy of the edit log.

HDFS Federation allows to have namespace volume. It enables support for multiple namespaces in the cluster to improve scalability and isolation.

HDFS can be used on UI tools such as Apache Ambari, command line interfaces, interface libraries like Java Python PHP etc.

HDFS command line example:

The HDFS command line example here assumes that you have a Hadoop cluster setup on Linux and you are connected to the node using putty.

All the commands of HDFS use a prefix of Hadoop fs -, and below are some common examples:

  1. List the files on the hadoop cluster : hadoop fs -ls
  2. Create a new directory : hadoop fs -mkdir hadoop-trigent
  3. Copy a file from local file system to HDFS : hadoop fs -copyFromLocal <<filename>> <<HDFS filename>>
  4. Copy a file from HDFS to Local file system: hadoop fs -copyToLoca <<HDFS filename>> <<Locatl file name>>
  5. Remove a file : hadoop fs -rm <<filename>>
  6. Remove a directory : hadoop fs -rmdir hadoop-trigent
  7. To see the commands available : hadoop fs

For more information on the HDFS command line please refer:

Introducing Apache Ambari

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides Restful APIs and a web-based management interface.

Ambari started as a sub-project of Hadoop but has now become a fully-fledged tool used by many Hadoop developers and administrators. The latest version available is Ambari 2.5.1

System requirements for Ambari:

Ambari can be installed only on UNIX based servers as of now. You would also need the following add-ons / packages.

  • xCode
  • JDK 7.0 (in case you use a previous version of Ambari it can be compiled using JDK 6.0). The future versions of Ambari>3.0 will require JDK 8.0
  • Python and Python setup tools
  • rpmbuild
  • gcc-c++ package
  • nodejs

Running the Ambari server:

Download the tarball and install it on the server.

Type command ambari-server setup. This will initialize the Ambari server.

To start /stop /restart/ check the status of the Ambari server use the following command:

ambari-server start/stop/status

To login to the Ambari server, open the URL: http://<<your-ambari-server>>:8080/. The default username and password is admin and admin respectively.

Changing the default port 8080 on the Ambari server:

To change the port of Ambari server , open the following file


search for the line that starts with client.api.port, it would be like:

client.api.port = 8080

change it to

#client.api.port = 8080

client.api.port = 8090 (We have taken a backup in case the Ambari is not happy using the new port).

Save the file and then restart the Ambari server.

sudo ambari-server stop

sudo ambari-server start

If you have a webserver such as Apache running, you can also do a proxypass and proxyreserve to transfer all the requests on a particular URL to the Ambari port.

Deploying Ambari client on the clusters:

  • Download and install the Ambari agent rpm on the clusters.
  • Edit the file /etc/ambari-agent/conf/ambari-agent.ini and in the location update the ip address / location of the ambari server
  • then start the Ambari agent using command ambari-agent start

Ports used by Ambari:

The Ambari uses the following default ports:

  • 8080-for Ambari web interface
  • 8440 – for connection between Ambari agents to Ambari server.
  • 8441 – for registration and providing the heartbeat from Ambari agents to Ambari server.

When the Ambari host does not connect to Ambari server, there are basic checkpoints that you can perform:

  • In case there is a firewall between the Ambari host and server, check that the port 8440 and 8441 are allowed over the firewall.
  • Check the iptables for the rules pertaining to Ambari ports
  • Disable selinux on both server and client and check if the host is able to connect to the server.
  • Check the logs available at location /var/lob/ambari-agent/ambari-agent.log to see the error messages.

Further information on Ambari is available on:

Machine Learning and Neural Networks using PHP

Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.

Machine learning is a method by which knowledge is gained  through learning the patterns and the method of the data. The effect of Machine learning is that the algorithm or the process of data handling changes itself based on the type of data it encounters in  real-time.

A good example of a Machine learning could be a automatic scaling of the ec2 instance in AWS which can increase/decrease the resources based on the load without any human intervention.

An Artificial Neural Network is nothing but a Machine learning to have machines behave like human beings. Just see the images below.

The images are all the same of one digit `9′ but different in shapes.  Therefore, how do we identify that its shape is 9 (nine)? This is because of the training our system has got that 9 has curve in the top followed by a curve or straight line to the bottom.

An ideal use case of a neural network algorithm could be in finding out an image in a list of thousands of images that matches one particular image that we are trying to match.

Implementation of a Neural Nework in PHP

Some people assume that PHP is majorly a Web language not capable of doing sophisticated tasks.But with the growth of technology, it can handle complex tasks and has many out-of-the-box features.

PHP provides a library FANN (Fast Artificial Neural Network). PHP binding for FANN (Fast Artificial Neural Network) includes a framework for easy handling of training data sets. It is easy to use, versatile, well documented, and fast.The PHP website claims that it can support both fully connected networks and sparse networks.

The sparse network is a network in which nodes are connected to specific nodes only, i.e. the number of vertices are less and it is difficult to create. An example of sparse network is social friendship websites where the network is not so easily created

A full connected network has all the nodes connected to each other node.A network of a family is a full connected network where each member is connected to other member.

The installation of PHP-FANN module requires the PHP 5.2.0 and above and the libfann version 2.1.0 and above.

To install this you have to first install the fann devel.

# sudo yum install fann-devel

Then download the php-fann from the github / pecl website and install it, The installation instruction is similar to the gearman installation as described in blog

It is always good to re-run your phpinfo page to see that the module is installed well and there are no errors on it.

There is a well documented example on the website to test this module.

The example shows how to train a set of data for the xor functionality. After training the four sets from the fifth set,  the inputs will be automatically defining the ouputs based on the trained data.

Another simple example of the data training would be so have the database increment a counter when a specific keywords are introduced

The FANN can be combined with Gearman so that this training of data can keep happening on the background on realtime, yet there is no literature which authorizes this.

Also there is php-ml library for implementation of Artificial intelligence and Neural Networks using php but this requires php version 7.0 as a basic requirement.

Further information on FANN and other topics can be read at

Using Gearman with PHP

The Gearman is a extension which is used to distribute work load to different processes or machines to optimize them. It is an anagram for a manager who can efficiently delegate tasks. Gearman is an opensource application framework designed to distribute appropriate computer tasks to multiple computers, so large tasks can be done more quickly. This was developed by Brad Fitzpatrick in 2009 and was initially written for Perl-based applications.

Benefits of Gearman

  • Gearman is licensed under Berkeley Software Distribution so it is easy to get the latest builds and add one’s own functionality to it.
  • As it supports multiple languages it can be used to create heterogenous applications. For example, we can use PHP and Java for the same project. where Java uses a function in PHP.
  • We can implement parallel processing using Gearman.
  • The gearman can support messages upto 4GB in size and can even chunk messages so that the overhead can be minimal.

Gearman Architecture

  • The Gearman architecture consists of three major components :
    • The gearman server,
    • A Gearman client, and
    • The Gearman worker.

The client sends the request to the worker via the server and the server works viceversa The Gearman has client and worker APIs that your programs use to communicate with the Gearman job server  This ensures that you  don’t have to configure the networking or mapping of jobs. Behind the screen, the Gearman client and worker APIs communicate with the job server using TCP socket.

Installing Gearman

  • To start with the Gearman should be installed and PHP should be configured to use the Gearman configuration.
  • Check if the Gearman is available to PHP.

Next download Gearman

This will download a Gunzip file,  We have to extract it:

Configure the Gearman:

The above steps can be followed by make and make install.

To check if the Gearman is installed properly use the following command:

Nextm we configure PHP with Gearman support.

The downloaded Gearman file will be of few kbs so the download is pretty fast. The next step is to configure and install it with the PHP installation on the server

  • tar -xzvf gearman-0.6.0.tgz
  • cd gearman-0.6.0
  • phpize
  • ./configure
  • make
  • sudo make install

After this you can go to the modules directory of the PHP installation and see that the file is listed there. Open your php.ini file and add the extension to Gearman, alternatively you can also go to the php.d directory and create a gearman.ini file to connect to the gearman system object.

  • cd /etc/php.d
  • touch gearman.ini

In the gearman.ini write the following content.

; Enable gearman extension module

extension =

Testing the Gearman on the website

Once the above step is completed , just restart the Apache service and load up the PHP info on your browser to see if the Gearman support is enabled.

Testing the Gearman installation on the Local machine

Create a php file with the following code :

 echo "The gearman version installed is".gearman_version()."n"

Once you run this command you should get an output similar to following:

The Gearman version installed is 0.6.0

Creating the worker

The worker is the file which will read and reply to all incoming requests.

 $worker= new GearmanWorker();
 $worker->addFunction("job", "job_function");
 while ($worker->work());

function job_function($job)
return strlen(strtolower($job->workload()));

Once the worker is created let it run as a background job in linux

php worker.php&

We can check if the worker is running by using the following command

ps -elf | grep worker.php

Now create the client for this worker

 $client= new GearmanClient();
 print $client->do("job", "How many words are there in this string");
 print "n";

Save the file as client.php

Now you can run this client by calling php client.php You can see output as 39

For more information on the gearman please go to the following resources

Natural Language Programming (NLP) – A must for eCommerce sites

The secret of a successful mobile/web application depends largely on the user experience and its value to end-users. By and large, industry experts agree that for a product to be accessible by end-users, the app or  website must completely capture their attention.

With e-commerce dramatically replacing brick and mortar showrooms, the competition to hold viewer attention has taken on gargantuan proportions. Users often leave web pages in 10-20 seconds, and that is the length of time to convert a visitor into a customer.  Also, with the internet making it far easier to access reviews, most users check for product reviews before committing to a buy. Reviews, therefore, help to convert visitors into customers and their importance cannot be undervalued.  It is for this reason, that NLP (Natural Language Programming) becomes really important. The official description of NLP is  that it is a precise formal description of some procedure that its author created. It is human readable and it can also be read by a suitable software agent.

When NLP is incorporated in an eCommerce site, it can help with quick decision making, as the customer gets clear, precise reviews, such as,   a `yes’, a `no’ or a `maybe’.   Since it `speaks’ the language of the user it could lead to far more visits, thereby improving Google ranking.

For more information on NLP, you could visit the following websites:

Exit mobile version