Ender Dincer
Jan 15, 2023
Apache Kafka Fundamentals
This article covers what Apache Kafka is and the common domain terminology. The promise is, at the end of the article you will be able to speak the “Kafka language”.
Apache Kafka is an open-source software. There are many definitions we can make for Kafka. A distributed commit log, an event streaming platform, a distributed messaging system.
Kafka implements the publish-subscribe pattern so we will start with examining pub-sub then we will look into Kafka's components.
Modern distributed applications consist of many small services which need to communicate with each other. The more services there are in the system, the more important the efficiency of communication becomes. Using a blocking (synchronous) communication method like classic REST APIs adds more dependencies with each new service. To avoid this, asynchronous messaging can be used. The most basic non-blocking messaging pattern is a messaging queue like AWS SQS or ActiveMQ. It consists of a sender, a message queue and a receiver. In this pattern, the sender sends the messages to a specific receiver and messages are usually deleted after consumption. Introducing a new dedicated queue to the application for each new receiver does not scale well.
Publish-subscribe pattern is another messaging pattern. Unlike a messaging queue, publishers broadcast messages to a topic without knowing which subscriber(s) will consume the messages. Likewise, subscribers can subscribe to these topics without the knowledge of publishers. Since publishers are decoupled from subscribers the topology of the application is very flexible; adding/removing publishers or subscribers is simple. You can think of the message queue pattern as a private chat application like WhatsApp where you know whom you send the message to whereas the pub-sub pattern is more like Twitter where your post (message) can be seen by any follower (subscriber) of your account (topic).
We have talked about the components of the pub-sub pattern; publishers, subscribers, messages and topics but we haven’t talked about what they actually can be physically. Publishers and subscribers can be servers running Spring Boot, Node.js or even serverless functions like AWS Lambda or GCP Functions. Topics are what Apache Kafka brings to the picture.
A Kafka topic is used for message categorization and organisation. For fault tolerance, topics are replicated to different servers called Kafka brokers. Multiple Kafka brokers and a broker manager form a Kafka cluster.
Producers, Consumers and Kafka Brokers
Topics are not the smallest data structures that store messages, they can consist of single or multiple logs called partitions.
In terms of data structures, a partition is a commit log. Commit logs are sequences of records. Each new record is appended at the end of the log and read from first to last - FIFO. Finally, values of the records in a commit log are immutable.
Okay, is this not simply a queue then? No, there is a fundamental difference between the two. On each read from a queue, data is removed from the queue. A queue is transient. If multiple applications were to consume a queue they would race for the records. However, reading records do not remove the data from the log. Therefore, multiple subscribers can read data from logs. We can visualise it like below.
Commit Log vs Queue
An event may trigger a heavy, time consuming task. To decrease the processing time of each event, you can add more resources to your consumer like more RAM and CPU (vertical scaling), but it has its own limits. Thanks to Kafka you can scale your application horizontally. Kafka allows multiple consumers to consume the same topic in parallel. These consumers are called a consumer group. A Kafka broker identifies the consumer group with their consumer group id. Brokers assign partitions to consumers if they’re in a consumer group. If not, you have to configure consumers to consume from certain partitions. You can run multiple instances, as shown below, in group 1 to process the data stream in parallel and with group 2 you can process the same data stream with less resources for business analytics or monitoring purposes.
Consumer Group
Consumer Group
For some applications, the order of messages is critical. Kafka guarantees the order within a partition but not within a topic. Kafka simply uses an integer number, called the offset, to identify records and maintain the order in partitions. Offsets are unique per partition. With each new event, the offset is incremented by 1. If a topic-level order is absolutely required, a topic with a single partition can be used. However, your application would lose scalability, concurrent processing potential Kafka offers.
We talked about how well Kafka decouples producers and consumers and yet there is a dependency we can’t ignore: schemas. Schemas should be respected and protected by both sides because they're contracts. They are skeletons or blueprints of the data. They define the structure of the events. There are many schema types you can use with Kafka.
Due to new requirements, optimisations, refactoring or other reasons schemas will eventually evolve. The new version of the schema should be made available to the consumers so that they can deserialize the data. Usually, this is done by an application called schema registry - external to Apache Kafka.
There are hundreds of configurations for brokers, topics, consumers and producers. Of course we can’t go through it all but there is one I have to mention to complete the picture: the clean-up policy. There are two main types of clean-up policies.
The default policy, that is if you don’t set a clean-up policy, is "delete". The delete policy depends on another configuration called log retention. If an event reaches the log retention time then it is deleted from the log.
The second policy is “compact”. In this policy, log retention is ignored and messages are not deleted with time. Instead, they’re deleted based on their keys. In simple terms, messages that have the same key are deleted and only the latest event will be retained in the log. The aim is to have the latest value for every key. It’s important to note that messages are not re-ordered. They’re deleted and remaining messages keep their offsets. Triggering compaction is complex and when not configured properly it can put a lot of load on your Kafka cluster. Triggering conditions deserve its own article.
You can use both of these policies at the same time. Whichever condition is met first, corresponding policy is triggered.
The producer API allows events to be published with only a topic name and a key. In other words a key with no value or a null value. These kinds of events are called tombstones. Tombstones are used as “logical deletes” in Kafka. If consumers know how to interpret these events then they can take them as delete operations. If the clean-up policy is compact, events will actually be deleted from the logs.
We have seen components of a Kafka cluster and how they work with each other. If you joined a new team who uses Kafka or if you started learning Kafka on your own, this article should have given you enough for your first few weeks. To expand your Kafka knowledge, I highly recommend reading the book "Kafka: The Definitive Guide", written by one of Kafka's co-creators. Finally, Kafka is open source but it can be hard to manage and maintain a cluster. Luckily there are platforms like Confluent, AWS, Azure and many others that manage Kafka and abstract most of the complexities for you.