Diving into Distributed Systems: Why We Chose to Build a Distributed Vector Database
From Panvel, Maharashtra, this has probably been one of the most intense learning phases of my dev journey.
If you have read my recent posts, you already know I like low-level projects, Linux experiments, and systems work. But this time, my team and I picked something bigger for our major project: a distributed vector database.
We could have played safe with a normal app. Instead, we picked the option that would force us to deal with the hard stuff: partitioning, replication, consistency, failure recovery, and query performance across nodes.
That one decision changed how I think about software at scale.
Why a Distributed Vector DB?
Vector databases are now core infrastructure for modern AI applications:
- Semantic search
- RAG pipelines
- Recommendations
- Image/audio similarity retrieval
Everyone uses them. We wanted to understand how they actually work under the hood.
Building a distributed version gave us the perfect pressure test:
- Shard data across nodes
- Replicate for fault tolerance
- Decide consistency behavior during writes and reads
- Handle distributed ANN query execution
- Survive node failures and network splits
It was not just “AI project hype.” It was a systems project disguised as an AI-era problem.
Why This Hit Us at the Right Time
After doing regular app work for a while, you eventually feel the limits of single-machine thinking.
The interesting problems are in distributed behavior:
- What happens when a node dies mid-write?
- How do you rebalance shards without breaking queries?
- Where do you accept eventual consistency, and where do you not?
That is the layer where real engineering trade-offs become visible.
We knew this project would be messy. It absolutely was. But that was exactly the point.
The Learning Rabbit Hole
Once we committed, I went deep into distributed systems fundamentals.
Main resources that had the biggest impact:
- Designing Data-Intensive Applications (DDIA): replication, partitioning, consistency, stream processing
- System design practice resources (Grokking, ByteByteGo, Alex Xu) translated into vector-DB thinking
- Kafka concepts: log-based ingestion, partitioning, replayable pipelines
- Raft paper + thesis: leader election and replicated logs for metadata coordination
We spent a lot of time debating architecture choices instead of jumping straight into code.
Examples:
- Shared-nothing vs shared-storage design
- Where consensus is required and where it is overkill
- Query fan-out and top-k merge strategies
- Recovery behavior after partitions
That design-first discipline saved us from several bad implementation paths.
What We Built So Far
This is still a prototype, but it is already a strong learning artifact:
- Single-node vector store baseline
- ANN indexing layer (HNSW/IVF-backed experimentation)
- Sharding strategy for distributed placement
- Leader-follower replication model for key metadata paths
- Distributed query flow: fan-out to shards and merge top-k results
- Queue-based ingestion pipeline model
- Fault injection tests (node kill, delayed responses, partition simulations)
Seeing queries still return valid results after taking down a node was one of those big “okay, this is real” moments.
What This Project Taught Me
Biggest lessons:
- Failures are normal, not edge cases
- Every consistency guarantee has a performance cost
- Good distributed systems are mostly about trade-offs, not perfect answers
- Observability and testing strategy matter as much as core logic
This project made me more careful. I do less hand-wavy thinking now when someone says “just scale it horizontally.”
Final Thoughts
If you are in that phase where normal projects feel too predictable, pick one distributed project and commit to it.
Start small if needed: replicate a key-value store, shard a tiny dataset, implement a simplified leader election flow. You do not need a production-grade cluster on day one. You need exposure to failure and distributed trade-offs.
Our project repo is private for now because of college constraints, but I plan to publish a cleaned-up public version later with notes on architecture, sharding decisions, and what failed before what worked.
Keep building. Keep breaking. Keep learning.