WikiMonitor - Wikimedia Edit Streaming with Kafka
Live Edit Events → Kafka → Reliable Monitoring – No Batch Shortcuts
WikiMonitor is a stream-oriented project that ingests the live firehose of Wikimedia edit events (Wikipedia, Wikimedia Commons, etc.) and routes them through Kafka for real-time processing and health monitoring.
Instead of one-off API calls or batch jobs, everything is event-driven: continuous ingestion, reliable transport, idempotent consumption, and visibility into lag/throughput under natural burst patterns (e.g., viral article edits).
It’s a hands-on way to learn Kafka in production-like conditions without building a toy producer-consumer demo.
Why This Project?
Wikimedia’s public edit stream is one of the best free, real-world event sources: high volume (~10–100 edits/sec average, bursts to 1000+), diverse event types (edit, new page, log, categorize), and no auth needed.
I chose it to practice streaming fundamentals outside synthetic data:
- Ingestion under continuous load
- Partitioning and consumer scaling
- Lag monitoring and backpressure handling
- Idempotency and exactly-once semantics
- Replay from offsets after crashes/restarts
- Burst resilience (e.g., during major news events)
It forced thinking in offsets, consumer groups, delivery guarantees, and operational visibility — not just “send message, receive message”.
System Architecture
High-level flow:
- Producer — connects to Wikimedia SSE stream (
https://stream.wikimedia.org/v2/stream/recentchange) - Normalization — parses JSON events, adds metadata (timestamp, topic routing key)
- Kafka Publish — sends to one or more topics (e.g.,
wiki.raw.edits,wiki.filtered) with proper partitioning - Consumers — multiple groups/processes read, process, aggregate (counts, anomalies), and monitor health
- Monitoring — track lag, throughput, error rates, partition balance
- Persistence — optional summaries or sampled events for later replay/analysis
Focus: stream reliability over fancy analytics — ensure nothing is lost, duplicated, or silently stalled.
Key Features
- Live ingestion from Wikimedia public SSE endpoint
- Kafka producer with retries, backoff, and acks=all
- Consumer groups with offset management and graceful shutdown
- Basic metrics: edit rate, lag per partition, consumer health
- Replay support: reset offsets to earliest/latest for testing
- Failure simulation: kill producer/consumer, network delay injection
- Idempotent processing (dedup by event ID where possible)
- Logging and health endpoints for observability
Technical Insights & Lessons
- Partition strategy matters — keying by wiki/project or page title enables better parallelism
- Lag visibility is everything — without it, silent failures build up
- Idempotency saves you during replays or retries
- Producer tuning (acks, retries, linger.ms) prevents burst amplification
- Consumer rebalancing can cause temporary lag spikes — plan for it
- Schema evolution — even simple JSON needs forward compatibility
- Burst handling — real Wikimedia spikes teach backpressure better than any benchmark
This project gave me confidence that I can design, deploy, and operate event pipelines without cloud abstractions.
Future Scope / TODOs
- Real-time anomaly detection (spike alerts, vandalism patterns)
- Multi-topic routing and filtering
- Basic dashboard (Prometheus + Grafana or simple web UI)
- Exactly-once semantics with Kafka transactions
- Geo/visualization of edit activity
- Alerting on high lag or consumer drops
Repository
- Code & Docs: github.com/pavandhadge/wiki-monitor-kafka-streaming
- Setup likely includes: Kafka broker (local or Docker), Python producer/consumer scripts, Wikimedia SSE client (sseclient-py or requests), monitoring helpers
If you’re learning Kafka or event streaming, this is a great starter: real data, no mock events, and real failure modes.
“Streaming systems reward clarity: when flow control, lag, and failure paths are explicit, operations become predictable — even when the firehose opens.”