WikiMonitor - Wikimedia Edit Streaming with Kafka

WikiMonitor is a stream-oriented project that ingests the live firehose of Wikimedia edit events (Wikipedia, Wikimedia Commons, etc.) and routes them through Kafka for real-time processing and health monitoring.

Instead of one-off API calls or batch jobs, everything is event-driven: continuous ingestion, reliable transport, idempotent consumption, and visibility into lag/throughput under natural burst patterns (e.g., viral article edits).

It’s a hands-on way to learn Kafka in production-like conditions without building a toy producer-consumer demo.

View WikiMonitor on GitHub →

Why This Project?

Wikimedia’s public edit stream is one of the best free, real-world event sources: high volume (~10–100 edits/sec average, bursts to 1000+), diverse event types (edit, new page, log, categorize), and no auth needed.

I chose it to practice streaming fundamentals outside synthetic data:

Ingestion under continuous load
Partitioning and consumer scaling
Lag monitoring and backpressure handling
Idempotency and exactly-once semantics
Replay from offsets after crashes/restarts
Burst resilience (e.g., during major news events)

It forced thinking in offsets, consumer groups, delivery guarantees, and operational visibility — not just “send message, receive message”.

System Architecture

High-level flow:

Producer — connects to Wikimedia SSE stream (https://stream.wikimedia.org/v2/stream/recentchange)
Normalization — parses JSON events, adds metadata (timestamp, topic routing key)
Kafka Publish — sends to one or more topics (e.g., wiki.raw.edits, wiki.filtered) with proper partitioning
Consumers — multiple groups/processes read, process, aggregate (counts, anomalies), and monitor health
Monitoring — track lag, throughput, error rates, partition balance
Persistence — optional summaries or sampled events for later replay/analysis

Focus: stream reliability over fancy analytics — ensure nothing is lost, duplicated, or silently stalled.

Key Features

Live ingestion from Wikimedia public SSE endpoint
Kafka producer with retries, backoff, and acks=all
Consumer groups with offset management and graceful shutdown
Basic metrics: edit rate, lag per partition, consumer health
Replay support: reset offsets to earliest/latest for testing
Failure simulation: kill producer/consumer, network delay injection
Idempotent processing (dedup by event ID where possible)
Logging and health endpoints for observability

Technical Insights & Lessons

Partition strategy matters — keying by wiki/project or page title enables better parallelism
Lag visibility is everything — without it, silent failures build up
Idempotency saves you during replays or retries
Producer tuning (acks, retries, linger.ms) prevents burst amplification
Consumer rebalancing can cause temporary lag spikes — plan for it
Schema evolution — even simple JSON needs forward compatibility
Burst handling — real Wikimedia spikes teach backpressure better than any benchmark

This project gave me confidence that I can design, deploy, and operate event pipelines without cloud abstractions.

Future Scope / TODOs

Real-time anomaly detection (spike alerts, vandalism patterns)
Multi-topic routing and filtering
Basic dashboard (Prometheus + Grafana or simple web UI)
Exactly-once semantics with Kafka transactions
Geo/visualization of edit activity
Alerting on high lag or consumer drops

Repository

Code & Docs: github.com/pavandhadge/wiki-monitor-kafka-streaming
Setup likely includes: Kafka broker (local or Docker), Python producer/consumer scripts, Wikimedia SSE client (sseclient-py or requests), monitoring helpers

If you’re learning Kafka or event streaming, this is a great starter: real data, no mock events, and real failure modes.

“Streaming systems reward clarity: when flow control, lag, and failure paths are explicit, operations become predictable — even when the firehose opens.”