Hadoop Distributed File System – An Overview

Hadoop Distributed File System (HDFS) is the file system on which Hadoop stores its data. This is the underlying technology that helps the data to be stored in the distributed manner across the cluster. It helps applications to get access to the data for fast mining/analyzing and users can be assured that the data that is saved on the HDFS file system is without any corruption.

HDFS is usually used for storing and reading large files. These large files can be a continuous stream of data from a web server or a vehicle GPS data or even the pulse beat data of a patient. These large files can be easily stored across multiple clusters in a distributed manner and HDFS helps us in achieving this. HDFS decomposes the large file into small blocks (which by default are of 128 MB size). In fact, HDFS can even split the processing of such large files to multiple nodes / servers. So each server processes the small portion of a large file in parallel. These blocks are spread across each node in the cluster. The HDFS makes copies of these blocks, so in case the block on a server is corrupt HDFS can quickly regenerate it from the backup block so there is minimal data loss. The backup of the block is also on another node as well.

Related: Leverage our Big Data Services to Get Insights From your Structured and Unstructured Data Repositories.

In a generic HDFS architecture there is a Name node and a Data node. The Name node keeps the address of the small blocks that the file is split into. It keeps the address translation of the blocks to identify which node and which block to read the file chunk from. The Name node also has a edit log for audit purpose. The Data node is the place where actual files blocks are stored. Whenever a file is requested, this data node will return the file content. The Data nodes talk to each other and are in continuous sync to update the file blocks using real-time replication.

If we have to read a file in HDFS, a message is sent to the Name node, and the name node will reply with the data node and the block information. Now the client application can connect to those specific data nodes and the data nodes can give them the file block that is being requested. There are client library in Python Java and other programming languages which can do this job.

To write a file to HDFS, a message is first sent to the Name node to create a new entry for that file. The client application will then give this information to a single data node, and then the data node will replicate the information with other data nodes in real time fashion. Once the file is stored the data node sends an acknowledgment to the Name node via the client application so that the Name node updates the information on the file blocks and the Data nodes that it is stored on.

The Name node is a very important candidate of the HDFS architecture. Using the edit log of the Name node we can rebuild the Name node. The edit log has the metadata of the data that can help us to create a new Name node. We can also have a secondary Name node which contains the merged copy of the edit log.

HDFS Federation allows to have namespace volume. It enables support for multiple namespaces in the cluster to improve scalability and isolation.

HDFS can be used on UI tools such as Apache Ambari, command line interfaces, interface libraries like Java Python PHP etc.

HDFS command line example:

The HDFS command line example here assumes that you have a Hadoop cluster setup on Linux and you are connected to the node using putty.

All the commands of HDFS use a prefix of Hadoop fs -, and below are some common examples:

  1. List the files on the hadoop cluster : hadoop fs -ls
  2. Create a new directory : hadoop fs -mkdir hadoop-trigent
  3. Copy a file from local file system to HDFS : hadoop fs -copyFromLocal <<filename>> <<HDFS filename>>
  4. Copy a file from HDFS to Local file system: hadoop fs -copyToLoca <<HDFS filename>> <<Locatl file name>>
  5. Remove a file : hadoop fs -rm <<filename>>
  6. Remove a directory : hadoop fs -rmdir hadoop-trigent
  7. To see the commands available : hadoop fs

For more information on the HDFS command line please refer:  https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html


  • Anubhav Jha is experienced in web development and specializes in languages such as PHP, MySQL, Linux and Java. He specializes in geo-spatial databases, CRM designing and search algorithms. His technology interests span machine learning, NLP, Analytics and Big Data.