Diving into Distributed Systems: Why We Chose to Build a Distributed Vector Database

From Panvel, Maharashtra, this has probably been one of the most intense learning phases of my dev journey.

If you have read my recent posts, you already know I like low-level projects, Linux experiments, and systems work. But this time, my team and I picked something bigger for our major project: a distributed vector database.

We could have played safe with a normal app. Instead, we picked the option that would force us to deal with the hard stuff: partitioning, replication, consistency, failure recovery, and query performance across nodes.

That one decision changed how I think about software at scale.

Why a Distributed Vector DB?

Vector databases are now core infrastructure for modern AI applications:

Semantic search
RAG pipelines
Recommendations
Image/audio similarity retrieval

Everyone uses them. We wanted to understand how they actually work under the hood.

Building a distributed version gave us the perfect pressure test:

Shard data across nodes
Replicate for fault tolerance
Decide consistency behavior during writes and reads
Handle distributed ANN query execution
Survive node failures and network splits

It was not just “AI project hype.” It was a systems project disguised as an AI-era problem.

Why This Hit Us at the Right Time

After doing regular app work for a while, you eventually feel the limits of single-machine thinking.

The interesting problems are in distributed behavior:

What happens when a node dies mid-write?
How do you rebalance shards without breaking queries?
Where do you accept eventual consistency, and where do you not?

That is the layer where real engineering trade-offs become visible.

We knew this project would be messy. It absolutely was. But that was exactly the point.

The Learning Rabbit Hole

Once we committed, I went deep into distributed systems fundamentals.

Main resources that had the biggest impact:

Designing Data-Intensive Applications (DDIA): replication, partitioning, consistency, stream processing
System design practice resources (Grokking, ByteByteGo, Alex Xu) translated into vector-DB thinking
Kafka concepts: log-based ingestion, partitioning, replayable pipelines
Raft paper + thesis: leader election and replicated logs for metadata coordination

We spent a lot of time debating architecture choices instead of jumping straight into code.

Examples:

Shared-nothing vs shared-storage design
Where consensus is required and where it is overkill
Query fan-out and top-k merge strategies
Recovery behavior after partitions

That design-first discipline saved us from several bad implementation paths.

What We Built So Far

This is still a prototype, but it is already a strong learning artifact:

Single-node vector store baseline
ANN indexing layer (HNSW/IVF-backed experimentation)
Sharding strategy for distributed placement
Leader-follower replication model for key metadata paths
Distributed query flow: fan-out to shards and merge top-k results
Queue-based ingestion pipeline model
Fault injection tests (node kill, delayed responses, partition simulations)

Seeing queries still return valid results after taking down a node was one of those big “okay, this is real” moments.

What This Project Taught Me

Biggest lessons:

Failures are normal, not edge cases
Every consistency guarantee has a performance cost
Good distributed systems are mostly about trade-offs, not perfect answers
Observability and testing strategy matter as much as core logic

This project made me more careful. I do less hand-wavy thinking now when someone says “just scale it horizontally.”

Final Thoughts

If you are in that phase where normal projects feel too predictable, pick one distributed project and commit to it.

Start small if needed: replicate a key-value store, shard a tiny dataset, implement a simplified leader election flow. You do not need a production-grade cluster on day one. You need exposure to failure and distributed trade-offs.

Our project repo is private for now because of college constraints, but I plan to publish a cleaned-up public version later with notes on architecture, sharding decisions, and what failed before what worked.

Keep building. Keep breaking. Keep learning.

Pavan Dhadge

pavandhadge01@gmail.com LinkedIn GitHub

pavandhadge01@gmail.com

it is what it is , life it shit happens