An Introduction to Apache Kafka

What is Kafka?

Kafka is an open-source distributed streaming platform by Apache software foundation and it is used as a platform for real-time data pipeline.  It is a publish-subscribe messaging system.

Kafka has the ability to auto-balance consumers and replicates the data enhancing reliability. Kafka offers better throughput for producing and consuming data, even in cases of high data volume, with stable performance. Kafka is a distributed system so it can scale easily and fast, and therefore has great scalability. Kafka relies on the principle of zero-copy.  It uses OS kernel to transfer the data and a distributed commit log and therefore can be considered durable. It has high throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large-scale message processing applications.

Related: Robust and scalable applications to enhance your web presence

Kafka was originally developed by LinkedIn than later it became opensource in 2011.

Kafka has the following capabilities:

  • It can be used  to publish and subscribe to streams of data like an enterprise messaging system unlike JMS because of  its speed and volume capabilities.
  • Kafka can be used for storing streams of records in fault-tolerant storages.
  • It can be used for processing streams of records which are on the pipeline, as and when they occur.

Kafka use cases:

  • Complex event processing (like part of an IOT system),
  • Building real-time data platform for event streaming,
  • Building intelligent applications for fraud detection, cross-selling, and predictive maintenance,
  • Real-time analytics (user activity tracking), and stream processing,
  • Ingesting data into Spark or Hadoop (both real-time pipelines and batch pipelines) and log aggregation.
  • Building a real-time streaming ETL pipeline.

Kafka can work with Spark Streaming, Flume, Storm, HBase, Flink, and Spark for real-time ingesting, analysis, and processing of streaming data.

Kafka

Terms:

Kafka stores data as records as they consist, i.e. key, value, and timestamp, which comes from many producers. The records are partitioned and stored with different partitions within different topics. Each partition is an ordered, immutable sequence of records. The records in the partitions are each assigned a sequential ID called the Offset which uniquely identifies each record within the partition.

Adding another dimension, the Consumer Group can have one or more consumers and it can query the messages on Kafka partitions from the topic.

Kafka Cluster runs with one or more number of Kafka Brokers / Servers / Node and partitions can be distributed across the cluster nodes.

Distribution:

Kafka partitions are distributed over the Kafka Cluster.  Each partition has one Leader Broker / Server and the rest of the brokers act as Follower Brokers.  Each server from a Kafka cluster handles the request and data. The Leader handles all requests, reads, and writes to the partition, while Follower passively replicates the data from Leader server so the load is well balanced within the Kafka cluster. If the Leader Broker fails, then one of the followers will be elected as a Leader. This Replication Factor is configurable for all the topics.

Kafka Cluster manages the brokers with the help of a connected Zookeeper server which provides service for the coordinated distributed system over the network.

Kafka Cluster Architecture:

The topics configured to use three partitions are given here. Each ID of the Replica is the same as the ID of the Broker.

Producers:

Producers publish data to appropriate topics. They have the responsibility to choose topics and partition  topics. Producer sends data as records and each record contains key and value pair so it converts data to byte array with the help of Key Serializer and Value Serializer. By default, partitioner chooses partition number by hash key or it can be done in a round-robin fashion. It has various approaches to send data to the server.

Consumers:

Consumers read and process the data from appropriate topics within the Kafka cluster. Consumers are labeled with consumer group names.  Those which have the same consumer group name for multiple consumers are called consumer groups. Kafka cluster delivers each record from the topics to single consumer instant of the consumer group. If each consumer instant has a different group name, then records are delivered to all consumer instants. Each consumer instant can run on a different process or different machine.

Conclusion:

Kafka provides highly scalable and abstraction solutions for the distribution system and various real-time processing.  Apache Kafka exists within the well-defined architectures of several leading applications such as Twitter, LinkedIn, Netflix, Uber, Yelp, and Ebay.

I have, in  this blog, covered some basic information, use cases, and terms. In my next blog, I will  write in detail about Kafka Producer, Partitioner, Serializer, Deserializer and Consumer Group.

comments
0