Wang.se - AI Agency

Chunk overlap: Use 10-20% overlap between chunks to avoid losing context at boundaries
Hybrid search: Combine dense vector search with sparse (BM25) retrieval for better recall
Reranking: Apply a cross-encoder reranker on top-k results before passing to the LLM
Metadata filtering: Leverage document metadata to narrow retrieval scope

Document Ingestion — Source documents are chunked into manageable pieces (typically 256-1024 tokens) and converted into vector embeddings.

Vector Storage — Embeddings are stored in a vector database (Pinecone, Weaviate, Qdrant, or pgvector) with metadata for filtering.

Query Processing — When a user query arrives, it’s embedded using the same model, then the database performs a similarity search to find the top-k most relevant chunks.

Context Augmentation — The retrieved chunks are injected into the LLM prompt as additional context, grounding the response in factual data.

Generation — The LLM generates a response conditioned on both the query and the retrieved context.

Article2

How RAG Pipelines Work

Best Practices