Kafkaesque Streaming
What is Kafka?
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. You can think of it as a distributed log system. Instead of sending data from point A to point B, Kafka acts as a durable, high-performance, fault-tolerant log where data is written once and consumed many times. It’s fast, scalable, and fault-tolerant.
Glossary
- Producer: Sends (produces) messages to Kafka topics.
- Consumer: Reads (consumes) messages from Kafka topics.
- Topic: A category/feed name to which messages are sent and from which consumers receive messages.
- Partition: Topics are split into partitions to allow scalability.
- Broker: A Kafka server that stores data and serves clients.
- Zookeeper: Manages Kafka’s metadata and cluster state (Kafka 3.x can run without it, but many setups still use it).
- Consumer Group: A group of consumers that coordinate to read data in parallel.
This article walks through Kafka from the ground up using kafka cli via docker, progressing into its more advanced capabilities and architectural nuances.
Kafka using docker
Start Kafka with Docker
A basic docker-compose.yml
for Kafka + Zookeeper below:
version: '3'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:latest
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
- Start it:
docker-compose up -d
- Verify that confluentinc/cp-kafka:latest is up using below and copy the
container id
(to be used in the commands going forward)
docker ps
Cli commands
List topics
docker exec -it <container_id> kafka-topics --bootstrap-server localhost:9092 --list
Create a Topic
docker exec -it <container_id> kafka-topics \
--bootstrap-server localhost:9092 \
--create --topic my-topic \
--partitions 1 --replication-factor 1
Describe a Topic
docker exec -it <container_id> kafka-topics \
--bootstrap-server localhost:9092 \
--describe --topic my-topic
Produce (Send) Messages
docker exec -it <container_id> kafka-console-producer \
--broker-list localhost:9092 --topic my-topic
Type your messages, press Enter to send each one.
Consume Messages
docker exec -it <container_id> kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic my-topic --from-beginning
Delete a Topic
docker exec -it <container_id> kafka-topics \
--bootstrap-server localhost:9092 \
--delete --topic my-topic
Start Consumer Group
docker exec -it <container_id> kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic my-topic --group my-group
List Consumer Groups
docker exec -it <container_id> kafka-consumer-groups \
--bootstrap-server localhost:9092 --list
Describe Consumer Group
docker exec -it <container_id> kafka-consumer-groups \
--bootstrap-server localhost:9092 \
--describe --group my-group
Cleanup
docker-compose down
docker volume prune # if you want to wipe all Kafka data
Kafka Architecture
Kafka is designed around a pull-based model, where consumers pull data from brokers at their own pace. Messages are stored durably on disk and replicated across brokers for fault tolerance.
Key architectural traits:
- Immutable commit log
- Offset tracking for exactly-once or at-least-once semantics
- Replication for durability and high availability
- Horizontal scalability by adding more brokers/partitions
Advanced Topics
Kafka Internals
Kafka uses a log-structured storage system, where data is appended to a log file. This enables:
- Fast disk I/O (sequential writes)
- Efficient compaction and retention policies
Delivery Semantics
Kafka supports:
- At-most-once (no retry)
- At-least-once (with retries)
- Exactly-once (via idempotent producers and transactions)
Kafka Connect
A framework for scalable and fault-tolerant integration with external systems like databases, cloud storage, etc.
Kafka Streams
Kafka’s native stream processing library for building real-time applications. It supports windowed joins, aggregations, and stateful transformations.
KSQL (ksqldb)
SQL-like interface to process and query Kafka topics in real-time. Great for monitoring, alerting, and ad hoc analytics.
Schema Registry
Used to enforce schema evolution and validation of messages (usually with Avro, Protobuf, or JSON Schema).
Security & Reliability
Kafka supports:
- TLS encryption (for data in transit)
- SASL authentication
- ACLs for fine-grained authorization
Reliability features:
- Leader election for partitions
- ISR (In-Sync Replicas) for high availability
- Durable storage with configurable retention
Design Patterns
- Event Sourcing: Storing state changes as a sequence of events.
- CQRS: Using Kafka to separate reads and writes with materialized views.
- Log Compaction: Retaining only the last message per key, ideal for changelog streams.
Best Practices
- Use idempotent producers for exactly-once delivery.
- Design partitioning strategy carefully (by key, round-robin, custom).
- Monitor with tools like Prometheus, Grafana, Confluent Control Center.
- Use Kafka Connect with dead letter queues for fault-tolerant ingestion.
- Leverage Kafka Streams or Flink for advanced real-time processing.
When to Use Kafka
Kafka excels when:
- You need to handle high-throughput, real-time data
- You're building event-driven or microservices architectures
- You want durable, replayable, fault-tolerant logs
It’s not ideal for:
- Short-lived messages (use Redis or RabbitMQ)
- Direct request-response communication
- Heavy transformations (use Kafka Streams or an external processor)