Project
MAR 2026

Pavan Dhadge

WikiMonitor - Wikimedia Edit Streaming with Kafka

Live Edit Events → Kafka → Reliable Monitoring – No Batch Shortcuts

WikiMonitor is a stream-oriented project that ingests the live firehose of Wikimedia edit events (Wikipedia, Wikimedia Commons, etc.) and routes them through Kafka for real-time processing and health monitoring.

Instead of one-off API calls or batch jobs, everything is event-driven: continuous ingestion, reliable transport, idempotent consumption, and visibility into lag/throughput under natural burst patterns (e.g., viral article edits).

It’s a hands-on way to learn Kafka in production-like conditions without building a toy producer-consumer demo.


View WikiMonitor on GitHub →

Why This Project?

Wikimedia’s public edit stream is one of the best free, real-world event sources: high volume (~10–100 edits/sec average, bursts to 1000+), diverse event types (edit, new page, log, categorize), and no auth needed.

I chose it to practice streaming fundamentals outside synthetic data:

  • Ingestion under continuous load
  • Partitioning and consumer scaling
  • Lag monitoring and backpressure handling
  • Idempotency and exactly-once semantics
  • Replay from offsets after crashes/restarts
  • Burst resilience (e.g., during major news events)

It forced thinking in offsets, consumer groups, delivery guarantees, and operational visibility — not just “send message, receive message”.

System Architecture

High-level flow:

  1. Producer — connects to Wikimedia SSE stream (https://stream.wikimedia.org/v2/stream/recentchange)
  2. Normalization — parses JSON events, adds metadata (timestamp, topic routing key)
  3. Kafka Publish — sends to one or more topics (e.g., wiki.raw.edits, wiki.filtered) with proper partitioning
  4. Consumers — multiple groups/processes read, process, aggregate (counts, anomalies), and monitor health
  5. Monitoring — track lag, throughput, error rates, partition balance
  6. Persistence — optional summaries or sampled events for later replay/analysis

Focus: stream reliability over fancy analytics — ensure nothing is lost, duplicated, or silently stalled.

Key Features

  • Live ingestion from Wikimedia public SSE endpoint
  • Kafka producer with retries, backoff, and acks=all
  • Consumer groups with offset management and graceful shutdown
  • Basic metrics: edit rate, lag per partition, consumer health
  • Replay support: reset offsets to earliest/latest for testing
  • Failure simulation: kill producer/consumer, network delay injection
  • Idempotent processing (dedup by event ID where possible)
  • Logging and health endpoints for observability

Technical Insights & Lessons

  • Partition strategy matters — keying by wiki/project or page title enables better parallelism
  • Lag visibility is everything — without it, silent failures build up
  • Idempotency saves you during replays or retries
  • Producer tuning (acks, retries, linger.ms) prevents burst amplification
  • Consumer rebalancing can cause temporary lag spikes — plan for it
  • Schema evolution — even simple JSON needs forward compatibility
  • Burst handling — real Wikimedia spikes teach backpressure better than any benchmark

This project gave me confidence that I can design, deploy, and operate event pipelines without cloud abstractions.

Future Scope / TODOs

  • Real-time anomaly detection (spike alerts, vandalism patterns)
  • Multi-topic routing and filtering
  • Basic dashboard (Prometheus + Grafana or simple web UI)
  • Exactly-once semantics with Kafka transactions
  • Geo/visualization of edit activity
  • Alerting on high lag or consumer drops

Repository

If you’re learning Kafka or event streaming, this is a great starter: real data, no mock events, and real failure modes.

“Streaming systems reward clarity: when flow control, lag, and failure paths are explicit, operations become predictable — even when the firehose opens.”