Kafkaesque Streaming

What is Kafka?

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. You can think of it as a distributed log system. Instead of sending data from point A to point B, Kafka acts as a durable, high-performance, fault-tolerant log where data is written once and consumed many times. It’s fast, scalable, and fault-tolerant.

Glossary

  • Producer: Sends (produces) messages to Kafka topics.
  • Consumer: Reads (consumes) messages from Kafka topics.
  • Topic: A category/feed name to which messages are sent and from which consumers receive messages.
  • Partition: Topics are split into partitions to allow scalability.
  • Broker: A Kafka server that stores data and serves clients.
  • Zookeeper: Manages Kafka’s metadata and cluster state (Kafka 3.x can run without it, but many setups still use it).
  • Consumer Group: A group of consumers that coordinate to read data in parallel.

This article walks through Kafka from the ground up using kafka cli via docker, progressing into its more advanced capabilities and architectural nuances.

Kafka using docker

Start Kafka with Docker

A basic docker-compose.yml for Kafka + Zookeeper below:

Copy
version: '3'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka:latest
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
  • Start it:
Copy
docker-compose up -d
  • Verify that confluentinc/cp-kafka:latest is up using below and copy the container id (to be used in the commands going forward)
Copy
docker ps 

Cli commands

List topics

Copy
docker exec -it <container_id> kafka-topics --bootstrap-server localhost:9092 --list

Create a Topic

Copy
docker exec -it <container_id> kafka-topics \
  --bootstrap-server localhost:9092 \
  --create --topic my-topic \
  --partitions 1 --replication-factor 1

Describe a Topic

Copy
docker exec -it <container_id> kafka-topics \
  --bootstrap-server localhost:9092 \
  --describe --topic my-topic

Produce (Send) Messages

Copy
docker exec -it <container_id> kafka-console-producer \
  --broker-list localhost:9092 --topic my-topic

Type your messages, press Enter to send each one.

Consume Messages

Copy
docker exec -it <container_id> kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic my-topic --from-beginning

Delete a Topic

Copy
docker exec -it <container_id> kafka-topics \
  --bootstrap-server localhost:9092 \
  --delete --topic my-topic

Start Consumer Group

Copy
docker exec -it <container_id> kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic my-topic --group my-group

List Consumer Groups

Copy
docker exec -it <container_id> kafka-consumer-groups \
  --bootstrap-server localhost:9092 --list

Describe Consumer Group

Copy
docker exec -it <container_id> kafka-consumer-groups \
  --bootstrap-server localhost:9092 \
  --describe --group my-group

Cleanup

Copy
docker-compose down
docker volume prune  # if you want to wipe all Kafka data

Kafka Architecture

Kafka is designed around a pull-based model, where consumers pull data from brokers at their own pace. Messages are stored durably on disk and replicated across brokers for fault tolerance.

Key architectural traits:

  • Immutable commit log
  • Offset tracking for exactly-once or at-least-once semantics
  • Replication for durability and high availability
  • Horizontal scalability by adding more brokers/partitions

Advanced Topics

Kafka Internals

Kafka uses a log-structured storage system, where data is appended to a log file. This enables:

  • Fast disk I/O (sequential writes)
  • Efficient compaction and retention policies

Delivery Semantics

Kafka supports:

  • At-most-once (no retry)
  • At-least-once (with retries)
  • Exactly-once (via idempotent producers and transactions)

Kafka Connect

A framework for scalable and fault-tolerant integration with external systems like databases, cloud storage, etc.

Kafka Streams

Kafka’s native stream processing library for building real-time applications. It supports windowed joins, aggregations, and stateful transformations.

KSQL (ksqldb)

SQL-like interface to process and query Kafka topics in real-time. Great for monitoring, alerting, and ad hoc analytics.

Schema Registry

Used to enforce schema evolution and validation of messages (usually with Avro, Protobuf, or JSON Schema).

Security & Reliability

Kafka supports:

  • TLS encryption (for data in transit)
  • SASL authentication
  • ACLs for fine-grained authorization

Reliability features:

  • Leader election for partitions
  • ISR (In-Sync Replicas) for high availability
  • Durable storage with configurable retention

Design Patterns

  • Event Sourcing: Storing state changes as a sequence of events.
  • CQRS: Using Kafka to separate reads and writes with materialized views.
  • Log Compaction: Retaining only the last message per key, ideal for changelog streams.

Best Practices

  • Use idempotent producers for exactly-once delivery.
  • Design partitioning strategy carefully (by key, round-robin, custom).
  • Monitor with tools like Prometheus, Grafana, Confluent Control Center.
  • Use Kafka Connect with dead letter queues for fault-tolerant ingestion.
  • Leverage Kafka Streams or Flink for advanced real-time processing.

When to Use Kafka

Kafka excels when:

  • You need to handle high-throughput, real-time data
  • You're building event-driven or microservices architectures
  • You want durable, replayable, fault-tolerant logs

It’s not ideal for:

  • Short-lived messages (use Redis or RabbitMQ)
  • Direct request-response communication
  • Heavy transformations (use Kafka Streams or an external processor)