Hadoop – Handling Big Data

Apache Hadoop is all about handling Big Data especially unstructured data. It helps in streamlining data for any distributed processing system across clusters of computers.

Activities on Big Data:

  • StoreBig Data needs to be collected in a repository and it is not necessary to store it in a single physical database.
  • Process – The process becomes more tedious in terms of cleansing, calculating, transforming and running algorithms.
  • Access – There will be no business sense if the data cannot be searched and retrieved and data must be virtually showcased along  business lines.

Hadoop Distributed File System (HDFS):

Hadoop stores large files  in the range of gigabytes to terabytes across various machines. HDFS provides data awareness between the task tracker and job tracker. The job tracker schedules jobs to task tracker in the data location.

The two main aspects of Hadoop are ‘Data processing framework’ and ‘HDFS’.  HDFS is a rack-based file system to handle data effectively. HDFS uses single-writer and multiple-reader models. It supports operations like read, write and delete files, to create and delete directories.

HDFS Architecture

Elements of HDFS Architecture:

  • Namenode

    Namenode is a commodity hardware that contains GNU/Linux operating system and Namenode software. It is a software that can run on commodity hardware. The system having Namenode acts as the master server. It manages the file system namespace and regulates clients access to files. It is also responsible to execute file system operations such as renaming, closing, opening files and directories.

  • Datanode

    Datanode is a commodity hardware having GNU/Linux operating system and Datanode software. For every node in a cluster, there will be a Datanode. It performs read-write operations on the file system, as per the client’s request. It also handles operations such as block creation, deletion and replication based on the instructions of the namenode.

  • Block

    User data is stored in the form of files of HDFS. These files in the file system are divided into one or more segments and stored in individual data nodes. These file segments are called blocks. The minimum amount of data that HDFS can read/write is a block. The block size is 64 MB by default and it can be increased as per requirements in HDFS configuration.

Data Processing Framework & MapReduce:

Data processing Framework is a tool used to process data.  It is a Java-based system called MapReduce. MapReduce algorithm contains two tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where each element is broken into <Key, Value> pairs.

MapReduce program executes in three stages –Map stage, Shuffle stage, Reduce stage.

  • Map Stage – The Map’s job is to process the input data. Generally, input data will be in the form of file or directory and it will be stored in Hadoop File System (HDFS). Input file is passed to mapper line-by-line. Then the mapper processes data and creates several chunks of data.
  • Reduce stage – This stage is a combination of Shuffle stage and Reduce stage. Reducer’s job is to process the data that comes from the mapper. It produces a new set of output which will then be stored in HDFC.

Hadoop Workflow

Benefits of Hadoop:

  • Hadoop is open source and because it is Java-based it is compatible with all  platforms.
  • It provides a cost-effective storage solution for businesses. It helps to easily access data sources and results in much faster data processing.
  • It  is a highly scalable storage platform, as it can store and distribute large data sets across hundreds of servers that operate in parallel.
  • A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated in other nodes in the cluster, which means that in the event of failure, there is another copy available for use.
  • It is widely used across industries such as finance, media, entertainment, government, healthcare, retail and so forth.
  • It provides great data reliability. It stores and delivers all data without compromising on any aspect.
  • It is very secured and authenticated. Its Hbase security, HDFS and MapReduce allow only approved users to operate on secured data and hence securing entire system from illegal access.

Author

  • Deepika M S

    Deepika works as Software Test Engineer with Trigent Software. She has over five years of IT industry experience in testing web-based & mobile applications using both manual and automation testing. Deepika is also experienced in identifying test scenarios and designing effective test cases and is well versed with SDLC/Agile and Scrum methodologies. Deepika has been involved in developing automated test scripts for new features, analyzing results and reports on test results.