An Introduction to Apache Kafka

What is Kafka?

Kafka is an open-source distributed streaming platform by Apache software foundation and it is used as a platform for real-time data pipeline. It is a publish-subscribe messaging system.

Kafka has the ability to auto-balance consumers and replicates the data enhancing reliability. Kafka offers better throughput for producing and consuming data, even in cases of high data volume, with stable performance. Kafka is a distributed system so it can scale easily and fast, and therefore has great scalability. Kafka relies on the principle of zero-copy. It uses OS kernel to transfer the data and a distributed commit log and therefore can be considered durable. It has high throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large-scale message processing applications.

Related: Robust and scalable applications to enhance your web presence

Kafka was originally developed by LinkedIn than later it became opensource in 2011.

Kafka has the following capabilities:

  • It can be used to publish and subscribe to streams of data like an enterprise messaging system unlike JMS because of its speed and volume capabilities.
  • Kafka can be used for storing streams of records in fault-tolerant storages.
  • It can be used for processing streams of records which are on the pipeline, as and when they occur.

Kafka use cases:

  • Complex event processing (like part of an IOT system),
  • Building real-time data platform for event streaming,
  • Building intelligent applications for fraud detection, cross-selling, and predictive maintenance,
  • Real-time analytics (user activity tracking), and stream processing,
  • Ingesting data into Spark or Hadoop (both real-time pipelines and batch pipelines) and log aggregation.
  • Building a real-time streaming ETL pipeline.

Kafka can work with Spark Streaming, Flume, Storm, HBase, Flink, and Spark for real-time ingesting, analysis, and processing of streaming data.

Terms:

Kafka stores data as records as they consist, i.e. key, value, and timestamp, which comes from many producers. The records are partitioned and stored with different partitions within different topics. Each partition is an ordered, immutable sequence of records. The records in the partitions are each assigned a sequential ID called the Offset which uniquely identifies each record within the partition.

Adding another dimension, the Consumer Group can have one or more consumers and it can query the messages on Kafka partitions from the topic.

Kafka Cluster runs with one or more number of Kafka Brokers / Servers / Node and partitions can be distributed across the cluster nodes.

Distribution:

Kafka partitions are distributed over the Kafka Cluster. Each partition has one Leader Broker / Server and the rest of the brokers act as Follower Brokers. Each server from a Kafka cluster handles the request and data. The Leader handles all requests, reads, and writes to the partition, while Follower passively replicates the data from Leader server so the load is well balanced within the Kafka cluster. If the Leader Broker fails, then one of the followers will be elected as a Leader. This Replication Factor is configurable for all the topics.

Kafka Cluster manages the brokers with the help of a connected Zookeeper server which provides service for the coordinated distributed system over the network.

Kafka Cluster Architecture:

The topics configured to use three partitions are given here. Each ID of the Replica is the same as the ID of the Broker.

Producers:

Producers publish data to appropriate topics. They have the responsibility to choose topics and partition topics. Producer sends data as records and each record contains key and value pair so it converts data to byte array with the help of Key Serializer and Value Serializer. By default, partitioner chooses partition number by hash key or it can be done in a round-robin fashion. It has various approaches to send data to the server.

Consumers:

Consumers read and process the data from appropriate topics within the Kafka cluster. Consumers are labeled with consumer group names. Those which have the same consumer group name for multiple consumers are called consumer groups. Kafka cluster delivers each record from the topics to single consumer instant of the consumer group. If each consumer instant has a different group name, then records are delivered to all consumer instants. Each consumer instant can run on a different process or different machine.

Conclusion:

Kafka provides highly scalable and abstraction solutions for the distribution system and various real-time processing. Apache Kafka exists within the well-defined architectures of several leading applications such as Twitter, LinkedIn, Netflix, Uber, Yelp, and Ebay.

I have, in this blog, covered some basic information, use cases, and terms. In my next blog, I will write in detail about Kafka Producer, Partitioner, Serializer, Deserializer and Consumer Group.

OAuth 2.0 – Part 2

Authorization Grant

In continuation to my earlier blog, there are four different of scenarios, clients and authorization flows where OAuth can be deployed.  I will now discuss OAuth 2.0’s four different grant types:

  • Authorization Code
  • Implicit
  • Resource Owner Password Credentials
  • Client Credentials

Authorization Code

Authorization code flow is commonly used for server-side application (where source code is not exposed as a publicly so that client secret can be a private). This flow is based on redirection and supports refresh_token flow.

This grant type is used where the API application supports third party client to access user information on behalf of users such as Facebook, Google, Twitter, etc.. So user can use a number of client applications by giving specific authorization and scopes.

  1. Client sends authorization core request to authorization server via its user-agent with certain properties such as client_id, scope, redirect_url and response_type.
  2. Authorization server verify the client details successfully then it redirect to authorization page with authorization list, if user not logged in it get authentication by redirecting to login page (never share resource owner credentials to client).
  3. Authorization server redirect back to the client with authorization code by redirecting. So the client receives authorization code.
  4. Client request access_token to authorization server with client_id, client_secret, grant_type and authorization_code (some other parameters depends on api server implementation). This call happens on server side without passing through user-agent.
  5. Authorization server validates the client details and authorization grant and if valid than it issues access_token (optionally refresh_token) to the client.

Implicit

Implicit flow is commonly used for web and mobile applications (where application runs in browser) and client confidentiality is not guaranteed. It does not support refresh_token to get access_token.

It is similar to authorization code flow but it passes the access_token directly to the client which means there is no intermediate code (authorization code).

This grant type is used where the API application wants to provide limited information to the client within a short time period. So the client application can use certain information to identify the users (like log in with Facebook, Google, etc…)

  1. Client sends user authorization request to authorization server via its user-agent with certain properties such as client_id, scope, redirect_url and response_type.
  2. Authorization server verify the client details successfully then it redirect to authorization page with authorization list, if user not logged in it get authentication by redirecting to login page.
  3. Authorization server redirect back to the client with access_token by redirecting. So the client receives access_token to continue accessing protected resources.

Resource Owner Password Credentials

Resource Owner Password Credentials flow is used for same party native client application (where an application is installed on the device) and where there is trust relationship between resource owner and client.

It can be used directly as an authorization grant to obtain access token. In this flow, the client asks username and password to resource owner and it sends this along with client credentials to get access token since the client and the Authorization Server are controlled by the same party. It is opposed to redirection flow.

The authorization server should take special care when enabling this grant type and only allow it when other flows are not viable.

  • Resource Owner gives username and password to the client application.
  • Client sends user credentials and client credentials to the authorization server with certain properties such as client_id, client_secret, username, password, scope, grant_type.
  • Authorization server responds back to the client with access_token and optionally refresh_token then client receives and continues accessing protected resources.

Client Credentials

Client Credentials flow is used for to perform non-user related tasks (where the client application needs to access resources or call functions in the resource server, which are not related to a specific resource owner).

Client credentials are used as an authorization grant typically when the client is acting on its own behalf. In has several drawback when used for normal purpose.

  1. Client sends client credentials to the authorization server with certain properties such as client_id, client_secret, grant_type.
  2. Authorization server responds back to the client with access_token and optionally refresh_token then client receives and continues accessing protected resources.

Refresh Token Flow

 Refresh tokens are credentials used to obtain access tokens. Refresh tokens are issued to the client by the authorization server and are used to obtain a new access token when the current access token expires. Issuing a refresh token is optional at the authorization server. If the authorization server issues a refresh token, it is included when issuing an access token. Most of the grant types support refresh token flow.

  1. Client requests access token to authorization server by client credentials with authorization grant.
  1. The authorization server validates the client details and resource owner authorizations. If it is valid then it issues an access and refresh token to the client.
  1. Client makes requests to get protected resource by passing access token to resource server.
  1. Resource server validates an access token and if valid, it responds to the request.
  2. Steps 3 and 4 repeat until access token expires. If client gets error for protected resource request then it skips to step 7, otherwise it makes another protected resource request.
  1. Resource server returns invalid token error since access token in invalid.
  1. Client makes new access token request to authorization server with client authentication and presenting refresh token. The client authentication is based in client type and authorization server policy.
  1. The authorization server validate the client authentication and refresh token. If it is valid then it issues new access token  and optionally refresh token.

Conclusion

 OAuth 2.0 provides flexible flow with different authorization grant types based on different scenarios with different use cases. Some flows are complex and difficult to understand. However, once you learn more about OAuth 2.0 verbs and nouns like Authentication, Authorization, Client Application, Resource Owner, Authorization server, Resource Server, Authorization Code, Access token and Refresh token etc.., you will be able to choose grant type base on you requirement and flows.

OAuth 2.0 – Part 1

Introduction

OAuth 2.0 (Open Authentication) is an authorization framework which enables websites or applications to obtain limited access to a HTTP service (such as Facebook, GitHub, Google, etc…). It is commonly used as a way for users to authorize third-party (websites or application) to access their information on other web services but without sharing their credentials. It is designed specifically to work with HTTP. OAuth 2.0 provides authorization flow for web, desktop and mobile applications. To summarize, OAuth is an authorization protocol, rather than an authentication protocol.

Where is OAuth Required?

The internet world has lots of web services running on it and as a result the need arises for certain web services to access information which is available in other web services. However, each web service has its own user authentication credential making it difficult to do so.

The idea, therefore, is to give user authorization to, for example, web service ‘A’ to access information from web service ‘B’. To elaborate, Google services like Google calendar, allows developers to access information from the calendar when a user provides credentials. There is a standard authorization protocol which is decided by web services developers to access user information with authorization.The OAuth and OAuth 1.0 versions are using cryptography signatures for added security and OAuth 2.0 has dropped cryptography signature by SSL/TLS. Thus, by using OAuth we can get authorization to get user information or perform certain functionalities on user’s behalf from third-party services.

Roles

  • Resource owner – He is capable of granting access to protected resources. When it is a person we call it as end-user.
  • Resource server – The Server which holds the protected resources. It is capable of accepting and responding to protected resource requests using access tokens.
  • Client – An application or website which is making requests for protected resources on behalf of resource owner and its authorization. It has Client Identifier which is publicly exposed string which is used by service API to identify the client application and Client Secret is used to authenticate the identity of the application to the service API when the application requests to access a user’s account, and must be kept private between the application and the API.
  • Authorization server – The server issuing an access_token to the client after successfully authenticating the resource owner and obtaining authorization.

Abstract Protocol Flow

  1. Client requests the resource owner for authorization. The authorization request can be made directly to the resource owner (as shown above) or indirectly via Authorization Server.
  1. Client receives an authorization grant from resource owner which is a credential representing the resource owner’s authorization.
  1. The client requests an access token by authenticating with the authorization server and presenting the authorization grant.
  1. The authorization server validates the client details and resource owner authorizations, if it is valid then it issues an access token to the client.
  1. Client requests to get protected resource by passing access token to resource server.
  1. Resource server validates an access token and, if valid, it responds to the request.

In my next blog, we will see how Authorization Grant is used to obtain Authorization.