Skip to content
All articles
Articlepostgrespgvectorragai

pgvector and Postgres for RAG: A 2026 Production Setup

How to build a production RAG pipeline on Postgres with pgvector in 2026: HNSW indexes, embedding hygiene, chunking strategies, hybrid search, and the failure modes that bite teams in production.

18 min read

Retrieval-Augmented Generation went from research demo to default architecture between 2023 and 2026. The question shifted from "does it work?" to "what's the cheapest, simplest, most observable way to ship it?"

For most teams in 2026, the answer is: Postgres with pgvector and HNSW indexes. This is what we've learned running RAG pipelines on Supabase and Neon over the last 18 months.

Why Postgres + pgvector won 2025

The dedicated vector databases (Pinecone, Qdrant, Weaviate, Milvus) are all good products. But for typical RAG workloads, millions, not billions, of vectors; sub-200ms query budgets; need to join against your business data, the calculus tipped toward Postgres for three reasons:

  1. One system to operate. Your vectors live next to your documents and users tables. No data sync. No second backup story.
  2. HNSW caught up. pgvector's HNSW implementation (0.5.0, late 2023) and the subsequent quantisation work in 0.7/0.8 made the latency gap with the specialised vector DBs irrelevant for most workloads.
  3. Joins are the killer feature. "Give me the top 5 most-similar chunks that this user is allowed to see and that haven't been deleted" is one SQL query in Postgres. In a vector-DB-first architecture, it's a careful dance.

The schema we actually use

A pattern that has worked across multiple production deploys:

rag-schema.sqlsql
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;  -- for hybrid search later.

CREATE TABLE documents (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  source_uri   text NOT NULL,
  title        text NOT NULL,
  tenant_id    uuid NOT NULL,
  -- metadata about provenance, freshness etc.
  metadata     jsonb NOT NULL DEFAULT '{}'::jsonb,
  created_at   timestamptz NOT NULL DEFAULT now(),
  updated_at   timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE chunks (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id  uuid NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  tenant_id    uuid NOT NULL,  -- denormalised for RLS + fast filtering.
  ordinal      int  NOT NULL,  -- position within the document.
  content      text NOT NULL,
  -- 1536-d for OpenAI embeddings, adjust for whatever model you use.
  embedding    vector(1536) NOT NULL,
  token_count  int NOT NULL,
  -- which embedding model produced this. Lets you migrate models gradually.
  model        text NOT NULL,
  created_at   timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX chunks_doc_idx     ON chunks (document_id);
CREATE INDEX chunks_tenant_idx  ON chunks (tenant_id);
CREATE INDEX chunks_model_idx   ON chunks (model);

-- HNSW index for the cosine-distance operator.
CREATE INDEX chunks_embedding_hnsw
  ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- And a GIN trigram index for hybrid lexical search.
CREATE INDEX chunks_content_trgm
  ON chunks USING gin (content gin_trgm_ops);

What this design buys you:

  • tenant_id denormalised onto chunks. RLS can filter by tenant before the HNSW index is consulted. Without this, the planner does the vector search first, then filters, much slower.
  • model column. Switching embedding models is the single most common Day-2 task. Keeping the model name on every row lets you re-embed gradually instead of as a big-bang.
  • ordinal for stable chunk order within a document. Useful for "show me the chunks before and after this match".

HNSW vs IVFFlat, settled

In 2024 there was a real debate. By 2026 it's closed: HNSW for new builds, IVFFlat only for very large corpora (>100M vectors) where HNSW's memory footprint becomes the problem.

The HNSW parameters that matter:

  • m (default 16), number of connections per node. Higher = better recall, more memory, slower build. 16-32 is the right range for most workloads.
  • ef_construction (default 64), quality of the index during build. Higher = better recall, much slower build. 64-200 for quality-sensitive applications.
  • ef_search, query-time parameter (set per session). Higher = better recall, slower query. SET LOCAL hnsw.ef_search = 100 for the queries where you care about recall.

Chunking strategy

Chunking is the place where retrieval quality is won or lost, and almost nobody talks about it because it isn't a model. Three rules:

1. Chunk semantically, not by character count

Fixed-size 512-token chunks were the 2023 default. By 2026, the better approach is structure-aware: split on markdown headings, then on paragraphs, with a soft target around 300-500 tokens. The tools to do this well (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's SentenceSplitter with sentence boundaries) are commoditised.

2. Include enough context to be useful in isolation

A chunk that says "The threshold is 500ms" is useless on its own. Prepend a synthesised header, the document title and the section's heading path, to every chunk before embedding it:

enrich-chunk.tsts
function enrichChunk(doc: Doc, section: Section, body: string): string {
  return [
    doc.title,
    section.headingPath.join(" > "),
    body,
  ].filter(Boolean).join("\n\n");
}

This dramatically improves retrieval quality, especially for technical docs. The cost is a few extra tokens per chunk. Worth it.

3. Overlap matters less than you'd think

20-token overlap between adjacent chunks used to be conventional wisdom. With sentence-aware splitting, we've found 0-token overlap performs almost identically and saves storage. Test on your own corpus before optimising for the overlap value.

Embedding hygiene

The unglamorous operational realities:

  • Never mix models. A cosine-distance comparison between embeddings from two different models is mathematically meaningless. Filter by the model column.
  • Re-embedding is a project, not a button. Plan it as a backfill: shadow-write new embeddings into a second column, run both indexes for a week, swap.
  • Cap input length at the model's recommended limit. Most embeddings models silently truncate past 8K tokens, but truncation usually means the chunk's "tail" is lost. Validate before sending.
  • Use batch embedding endpoints. Embedding latency is dominated by network round-trips, not compute. The provider's batch API is 5-10x faster per 1000 chunks.

Hybrid search: vector + lexical

Pure vector search is great for "find me content about X". It is terrible for queries that include proper nouns, product SKUs, or specific identifiers. "Did the docs mention ef_construction?" needs lexical matching, not similarity.

The pattern that has won in 2026 is reciprocal rank fusion of vector and lexical results. Postgres makes this trivial because the lexical side is just a pg_trgm or full-text search:

hybrid-search.sqlsql
-- Top-K from vector search.
WITH vector_hits AS (
  SELECT id, 1 - (embedding <=> $1) AS sim, row_number() OVER () AS rk
  FROM chunks
  WHERE tenant_id = $2 AND model = $3
  ORDER BY embedding <=> $1
  LIMIT 20
),
-- Top-K from lexical search.
lex_hits AS (
  SELECT id, similarity(content, $4) AS sim, row_number() OVER () AS rk
  FROM chunks
  WHERE tenant_id = $2 AND content % $4
  ORDER BY similarity(content, $4) DESC
  LIMIT 20
),
-- Reciprocal rank fusion (k=60 is the standard constant).
fused AS (
  SELECT id, sum(1.0 / (60 + rk)) AS score
  FROM (
    SELECT id, rk FROM vector_hits
    UNION ALL
    SELECT id, rk FROM lex_hits
  ) all_hits
  GROUP BY id
)
SELECT c.id, c.content, c.document_id, f.score
FROM fused f
JOIN chunks c ON c.id = f.id
ORDER BY f.score DESC
LIMIT 5;

$1 is the query embedding, $2 the tenant id, $3 the embedding model name, $4 the raw query text. In 50 lines of SQL you have a hybrid retrieval pipeline that beats pure vector search on most internal evals.

Performance budgets

For a corpus of 1M-10M chunks with the schema above, on a Supabase Pro / Neon Scale instance:

  • Single vector search, top-5, ef_search = 40: 10-30ms.
  • Single vector search, top-20, ef_search = 100: 30-60ms.
  • Hybrid search (query above): 50-100ms.
  • End-to-end RAG with one LLM call: dominated by the LLM, ~600-1500ms.

If you're past these numbers, the usual culprits are: missing tenant_id index, too-large ef_search, or the embedding column not being indexed. Check EXPLAIN ANALYZE on the query in a SQL playground; the plan should show an HNSW index scan, not a sequential scan.

Failure modes we see in production

1. Re-indexing during peak traffic

Rebuilding an HNSW index over a multi-million-row corpus can take 15 minutes to an hour and consumes IO + CPU. Schedule it during low traffic; use CREATE INDEX CONCURRENTLY to avoid blocking writes.

2. The chunk-table fragmentation problem

High-churn corpora (think: customer chat logs continuously rewritten as conversations evolve) cause heavy bloat on the chunks table. VACUUM FULL is not viable in production; instead, ensure autovacuum is tuned for the table's update rate, and consider partitioning by month.

3. The wrong distance metric

OpenAI embeddings are normalised, so vector_cosine_ops and vector_inner_product_ops give identical orderings. Cohere and Voyage are also normalised. Some open-weight models are not; for those you need to L2-normalise before insert or use vector_l2_ops with a different opclass on the index. Check the model card.

4. Cost surprises from embeddings

Embedding millions of chunks is cheap per row, expensive in aggregate. Track total tokens embedded as a first-class metric. For Suparbase users, our AI chat shows per-conversation token usage; for embeddings, you'll want a custom counter.

Once you have these four shapes covered, RAG on Postgres is one of the calmer parts of your stack. The complicated parts are upstream (chunking, prompts) and downstream (eval); the database is reliably boring, which is what you want.

Suparbase is an admin workspace for Supabase. Encrypted credentials, server-side proxy, RLS debugger, SQL playground, AI assistant with diff-confirmed writes. Free tier for solo projects.

Related articles