Introduction

The real competition is not pgvector versus Pinecone on a recall benchmark. It is pgvector versus the entire distributed system you build around a dedicated store: the dual-write pipeline, the consistency budget, the embedding-freshness SLA, and the operational surface area of two production systems instead of one. Engineers who run that comparison honestly keep finding that the break-even point is much further out than their initial architecture review assumed.

The Architecture Tax Nobody Invoices

The benchmarks are honest. What's dishonest is what they omit: the system you must build around a dedicated vector store to make it production-safe. That system has four layers, and each one is a failure domain you don't have with pgvector.

Layer 1: The Dual-Write Pipeline

Any architecture where vectors live in a separate store from the rows they describe requires a synchronization pipeline. At Retool's scale (internal tooling platform, hundreds of thousands of user-generated queries), this pipeline processes schema changes, row-level updates, and soft deletes — all of which must be reflected in the vector store before search results are valid.

PostgreSQL with pgvector makes this problem structurally impossible. Vector and row commit together in the same ACID transaction or neither commits. There is no T=0 to T=1.1s window because there is no second system.

Layer 2: The WAL Amplification Problem

This is the failure mode that hits bulk-embedding pipelines first, and almost nobody documents it. When you update an embedding column — 1536 float32 values = 6,144 bytes — PostgreSQL writes the entire new tuple to the WAL. For a 10-million-document initial embedding job running 5,000 upserts/second, the WAL write rate is approximately 61 MB/s.

Replicated deployments (any RDS Multi-AZ, any Patroni cluster) feel this most acutely: replicas must process the same 61 MB/s of WAL, and replica lag during embedding runs causes read-your-writes violations if your application routes reads to replicas. The fix is not just UNLOGGED tables — it is also ensuring wal_compression = lz4 is set, which reduces WAL volume by 40-60% for embedding columns.

Layer 3: The Connection Overhead Cliff

Vector search queries are CPU-intensive at the PostgreSQL process level. Each HNSW graph traversal at ef_search=100 on a 10M-vector corpus consumes approximately 8-15ms of CPU time on a modern core.

-- WAL amplification calculation:
-- 1536 dimensions × 4 bytes × 5000 upserts/sec
-- = 30,720,000 bytes/sec = ~30 MB/sec raw
-- With replication = ~60 MB/sec total

-- Enable WAL compression for embeddings
ALTER SYSTEM SET wal_compression = lz4;

-- Use UNLOGGED for bulk loads
CREATE UNLOGGED TABLE embeddings_bulk (
    id bigint GENERATED ALWAYS AS IDENTITY,
    document_id bigint NOT NULL,
    embedding vector(1536) NOT NULL
);

-- After bulk load, convert to LOGGED
ALTER TABLE embeddings_bulk SET LOGGED;

What the Query Planner Actually Does

The most widely repeated claim about pgvector — that it "pushes filters into the HNSW scan" — is mechanically wrong in a way that changes how you should design queries.

The Two-Phase Bitmap Intersection

PostgreSQL's HNSW access method has no predicate awareness during graph traversal. The HNSW graph is built purely on L2/cosine distance; it cannot evaluate a WHERE clause inside a graph edge traversal. What the planner constructs instead is a two-phase bitmap intersection:

Phase 1 — Bitmap Build: The planner issues Bitmap Index Scans on conventional indexes (B-tree on tenant_id, GIN on tsvector). Each scan returns a TID bitmap.
Phase 2 — HNSW Scan + Recheck: The HNSW AM traverses the graph and returns a ranked list of TIDs ordered by vector distance.

-- Two-phase query execution
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, document_id, 
       1 - (embedding <=> $query) as similarity
FROM embeddings
WHERE tenant_id = 5
  AND status = 'active'
ORDER BY embedding <=> $query
LIMIT 20;

-- Expected plan:
-- Bitmap Heap Scan on embeddings (Recheck)
--   Recheck Cond: (embedding <=> $query) < 0.5
--   Filter: (tenant_id = 5) AND (status = 'active')
--   BitmapAnd
--     Bitmap Index Scan on embeddings_tenant_idx
--     Bitmap Index Scan on embeddings_hnsw_idx

When the Planner Abandons the HNSW Index

PostgreSQL's cost model for HNSW scans uses a generic access method cost estimate that doesn't account for graph structure. On tables under ~50,000 rows, or immediately after a large bulk insert before ANALYZE runs, the planner systematically underestimates HNSW scan benefit and chooses Seq Scan.

The Memory Residency Constraint

This is the constraint that determines whether pgvector is viable at your scale:

The HNSW graph is stored as ordinary PostgreSQL heap and index pages on disk. During graph traversal, each hop to a neighboring node requires reading that node's page. If the page is in shared_buffers or the OS page cache: ~0.1μs per hop. If it requires a disk read: ~0.1ms per hop (NVMe) to 10ms (spinning disk).

-- Working Set Size Calculation
-- For 10M vectors with 1536 dimensions (float32):
-- Graph storage: N × M × 2 × 4 bytes
--   = 10M × 16 × 2 × 4 = ~1.28 GB
-- Vectors: N × dim × 4 bytes
--   = 10M × 1536 × 4 = ~61.4 GB
-- Total: ~62.7 GB

-- Use Scalar Quantization (pgvector 0.7+)
-- Compresses float32 (4 bytes) to int8 (1 byte)
-- Reduces total to ~16 GB for 10M vectors

-- Create SQ8-compressed index
CREATE INDEX ON embeddings 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

VACUUM and HNSW Graph Health

PostgreSQL's MVCC model marks updated and deleted heap tuples as dead but leaves them in place until VACUUM reclaims their pages. The interaction with HNSW indexes is less well-documented.

When heap tuples die, their corresponding HNSW nodes become phantom entries — they remain in neighbor lists and graph edges but represent content that no longer exists. Traversals that route through phantom nodes waste steps.

-- Reduce vacuum threshold for embedding tables
ALTER TABLE embeddings SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_vacuum_threshold = 1000
);

-- Monitor phantom node accumulation
-- (via reranking_correction in two-stage queries)
SELECT 
    count(*) as total_vectors,
    pg_size_pretty(pg_table_size('embeddings')) as table_size,
    pg_size_pretty(pg_indexes_size('embeddings')) as index_size
FROM embeddings;

-- If recall drops, consider REINDEX
REINDEX INDEX embeddings_hnsw_idx;

The Production-Grade Schema

Most pgvector schemas in documentation are technically correct and operationally naïve. They work at 100K vectors. The following schema is designed to survive at 50M vectors, high write rates, and multi-tenant isolation requirements.

-- Production-grade schema for 50M+ vectors
CREATE TABLE embeddings (
    id BIGSERIAL PRIMARY KEY,
    tenant_id BIGINT NOT NULL,
    document_id BIGINT NOT NULL,
    chunk_text TEXT,
    embedding_256 vector(256),  -- MRL for ANN
    embedding_full vector(1536), -- Full precision for reranking
    status TEXT DEFAULT 'active',
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Partial indexes per tenant (IVFFlat advantage)
CREATE INDEX idx_embeddings_tenant_256 
ON embeddings (tenant_id) 
USING ivfflat (embedding_256 vector_cosine_ops)
WITH (lists = 100);

-- HNSW for global search (if needed)
CREATE INDEX idx_embeddings_hnsw 
ON embeddings USING hnsw (embedding_full vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Two-stage retrieval pattern
SELECT e.id, e.document_id, 
       1 - (e.embedding_full <=> $query) as rerank_score
FROM (
    SELECT id, document_id, embedding_full
    FROM embeddings
    WHERE tenant_id = $tenant
    ORDER BY embedding_256 <=> $query_mrl
    LIMIT 200
) e
ORDER BY e.embedding_full <=> $query_full
LIMIT 20;

The Honest Comparison Matrix

The following table is calibrated to production evidence with explicit, measurable diagnostic signals:

Factor	pgvector Wins	Dedicated Store Wins
Working set in RAM	Less than 100 GB	Greater than 200 GB
Write latency SLA	Less than 100 ms	Greater than 500 ms acceptable
Filter selectivity	Greater than 1% (high)	Less than 0.1% (very sparse)
Vector dimension	Less than 2048	Greater than 4096

Key Takeaways

pgvector's filter speedup is a bitmap intersection, not predicate pushdown.Design queries with high-selectivity B-tree/GIN predicates to maximize this effect.
The scaling limit is RAM, and the formula is knowable in advance.Calculate (N × dim × 4) + (N × M × 2 × 4) bytes before building your HNSW index.
WAL amplification is the first operational failure mode in bulk-embedding pipelines.Use wal_compression=lz4, UNLOGGED tables for initial loads.
HNSW phantom node accumulation causes recall regression.Reduce autovacuum_vacuum_scale_factor to 0.01 on embedding tables.
The break-even for dedicated systems is deterministic, not intuitive.Run the working-set formula, measure your sustained QPS.

Conclusion

PostgreSQL with pgvector offers a compelling alternative to dedicated vector databases for most production workloads. The architectural simplicity of a single system — with ACID guarantees, unified queries, and operational familiarity — compounds as your application complexity grows. Run the numbers honestly, validate against your data distribution, and you may find that pgvector is exactly the AI infrastructure you need.

ByteHubble Engineering

Database Intelligence Platform

Helping teams build better databases with AI-powered insights.

Back to All Articles

PostgreSQL as AI Infrastructure