Post

Kafka Internals as Append-Only Logs

Explore Apache Kafka's log-based architecture: topics, partitions, segments, and offsets. Learn how append-only logs enable high-throughput messaging.

Kafka Internals as Append-Only Logs

Apache Kafka is often called a distributed commit log, but what does that mean in practice? This guide unpacks Kafka’s internal log-based architecture—covering topics, partitions, segments—and shows how these components enable high-throughput, scalable message processing.


🧠 Topic

  • A topic is a logical container grouping messages, like a folder.
  • It holds one or more append-only log files.
  • Topics are split into partitions to enable parallelism and scalability.

📄 Partition

  • Each partition is a single append-only log.
  • It preserves message order within itself.
  • Appending is a fast, sequential write to disk.
  • Multiple partitions allow multiple producers and consumers to work concurrently.

📦 Segment

  • Kafka divides partitions into fixed-size segments (e.g., 1 GB each).
  • Each segment is an individual file stored on disk.
  • New segments are created once size or time thresholds are met.
  • Old segments are deleted or compacted according to configured retention policies.

🔢 Offset

  • An offset uniquely identifies a message’s position within a partition.
  • It starts at 0 and increments with every message.
  • Consumers use offsets to keep track of their reading progress.

✍️ Producer

  • Producers send data to topics and choose partitions based on:
    • A deterministic key (e.g., hash of the key), or
    • Round-robin strategy (default)
  • Messages are appended in a write-ahead log fashion.

📥 Consumer

  • Consumers read messages from specific topic partitions by offset.
  • Kafka guarantees ordering only within partitions.
  • Each consumer tracks offsets on a per-partition basis (manually or automatically).

🧱 Broker

  • A broker is a Kafka server holding data.
  • It manages partitions, segments, and associated index files.
  • Brokers handle replication, client requests, and leadership elections.

🔁 Replication

  • Partitions have one leader and multiple followers.
  • Followers replicate the leader’s log in real-time.
  • This replication ensures fault tolerance and data durability.

🧹 Log Compaction

  • Log compaction retains only the most recent record for each key.
  • Useful for topics that represent state changes.
  • Compaction runs in the background and preserves the append-only nature of logs.

🛍️ Real-World Analogy: Kafka for User Activity Logs

Example: E-commerce Clickstream

  • Topic: user-clicks
  • Partitions:
    • Partition 0 → users where hash(key) % 3 == 0
    • Partition 1 → hash(key) % 3 == 1
    • Partition 2 → hash(key) % 3 == 2

Example Segment Files

1
2
3
4
/kafka-logs/user-clicks-0/00000000000000000000.log
/kafka-logs/user-clicks-1/00000000000000000000.log
/kafka-logs/user-clicks-1/00000000000001000000.log
/kafka-logs/user-clicks-2/...

Kafka Log Structure Mapping

Kafka TermLog AnalogyExample
TopicDirectory of logsuser-clicks
PartitionLog fileuser-clicks-1
SegmentRotated chunk file00000000000001000000.log
OffsetLine number1234
ProducerLog appenderAppends to user-clicks-1
ConsumerLog readerReads from user-clicks-1 @ 1234

🎯 Consumer & Partition Selection

Offset is Partition-Specific

  • Offset 500 in Partition 0 is completely independent from Offset 500 in Partition 1.
  • Consumers must keep track of offsets for each partition separately.

How Consumers Get Partitions

  1. Consumer Groups

    • Kafka automatically assigns partitions to consumers within a group.
    • Offsets are committed in the context of consumer groups.
  2. Manual Assignment

    • Applications can manually assign specific partitions.
    • Offsets can be explicitly specified (e.g., from start, end, or a checkpoint).

Kafka’s append-only log design is the foundation of its performance and reliability. Grasping these internals empowers you to design more efficient systems, especially suited for event-driven architectures, real-time analytics, and distributed data pipelines.

This post is licensed under CC BY 4.0 by the author.