Kafka: the Big Data streaming platform

In modern information systems, we are confronted with ever-increasing volumes of data requiring to be processed in real time. However, the point-to-point connections commonly used do not allow easy loading scalability. Data producing services have a strong link with the consumer services. It is from this observation that Kafka was brought up: a distributing message bus. Initially developed by LinkedIn teams, the project is now part of the Apache Foundation.

In this article, we will discover how Kafka functions and the answers it provides to the Big Data processing needs.

Kafka Components

Kafka properly speaking is composed of Brokers and Zookeeper. The brokers serve as pivot between the different services. They save data that passes through, while providing data redundancy to have a strong fault tolerance. Zookeeper, meanwhile, manages the distributed configuration, as well as coordinating brokers and monitoring the status of gravitational services around Kafka.

Among these services there are two main categories of application interacting with Kafka:

Producers produce a message stream to Kafka. They publish messages on a topic.
On the other hand Consumers read a message stream from Kafka. They subscribe to a topic.

Kafka provides complementary APIs to implement derived services:

Stream processors are both Consumer and Producer. Their role is to convert the data from one to several topics in input to one to several topics in output. The main feature provided by the Kafka Stream API is to join with a static data table to obtain a new rich stream, or join directly two streams.
Connectors, in turn, are used to bridge data sources. For example, by observing changes in a database and translating these changes into a message stream.

Topics

So far we have seen that the exchanges were macroscopically based on a notion of topic. More concretely, a topic corresponds to a logical unit containing messages of the same category. A message consists of an optional key and a byte array content. The use of this table allows freedom on the choice of exchange format. The most used being ‘Json’ and especially ‘Avro’.

At the lowest level, Kafka relies on a data structure called ‘Log’ to store topics hard drive. A log is an array of immutable messages ordered according to their publication and each having a unique offset. When a producer posts a message, each new message is added at the end of the table. Messages are spread across partitions based on a key.

In order to limit contention during write and read actions, a topic is broken down into partitions ensuring maximum parallelism. The maximum number of consumers acting simultaneously is directly correlated to the maximum number of configured partitions.

Moreover to ensure the high availability to the platform, these same partitions are copied to the different brokers based on the number of configured replication. If a broker falls, the platform can continue to operate without significant disruption.

This topic operation allows obtaining a robust system whose performance remains constant even when the data volume increases, even though the data is saved on disk. Kafka also gains a great memory advantage over systems. Persistence allows it to process data both in real time and in batch mode.

Consumer groups

At the consumer level, Kafka is based on a two-tier system. First, each consumer group can read each score. Then, the partitions are distributed within the same group. It is thus possible to read the same data several times for different uses by creating several consumer groups. If Consumer processing is time-consuming, just deploy an additional instance in the group to increase performance.

If a consumer stops responding, the currently playing partition is reassigned to another instance to the group to ensure that the messages will not be lost. Kafka assures in his contract that any message will be processed. However, it does not guarantee that a message will not be read repeatedly. During the reallocation of a partition, the new consumer resumes at the first offset.

Kafka at the heart of Big Data architectures

As we have just seen, Kafka is a high-performance, easily scalable system with a strong fault tolerance, which can both withstand the constraints in real-time and keep the data for deferred batchs processing.

It is not surprising to find Kafka at the center of implementations of architectures for Big Data, such as Lambda, Kappa or Smack.