Building RAG with Bolt: Complete Implementation Guide
Learn how to implement Retrieval-Augmented Generation (RAG) in your Bolt.new app. Complete guide covering vector databases, embeddings, document processing, and production optimization.

You’ve built an AI-powered app with Bolt. It has a chat interface, connects to an LLM, and can answer questions. But there’s a problem: it only knows what the LLM was trained on. Ask it about your company’s documentation, your product specs, or last week’s meeting notes, and it draws a blank.
This is where Retrieval-Augmented Generation (RAG) comes in. Instead of relying solely on the LLM’s training data, RAG retrieves relevant information from your own documents and uses it to generate more accurate, contextual responses. It’s how you make an AI that actually knows about your specific domain.
This guide covers everything you need to implement RAG in a Bolt app: from understanding the architecture to choosing a vector database, processing documents, and optimizing for production use.
What Is RAG and Why Does It Matter?
RAG combines the generative capabilities of LLMs with external knowledge retrieval. Instead of asking an LLM to answer from memory, you first search for relevant documents, then feed those documents to the LLM along with the question.
The RAG Process in Simple Terms:
- 1. User asks a question
- 2. System searches your documents for relevant content
- 3. Retrieved content is added to the LLM prompt
- 4. LLM generates an answer using that context
When to Use RAG
Good Use Cases
- • Customer support chatbots with knowledge bases
- • Internal documentation search
- • Legal or compliance document Q&A
- • Product information assistants
- • Research paper analysis tools
- • Personalized learning assistants
Not Ideal For
- • General knowledge questions (use LLM directly)
- • Creative writing without reference material
- • Real-time data (use API integrations)
- • Simple structured data queries (use SQL)
- • Tasks requiring zero hallucination guarantee
RAG Architecture Overview
A production RAG system has two main phases: ingestion (processing documents into searchable format) and retrieval (finding relevant content at query time).
Document Ingestion Pipeline
Read PDFs, web pages, markdown files, etc.
Break documents into smaller, manageable pieces (typically 500-1000 tokens)
Convert text chunks into numerical vectors using an embedding model
Save embeddings with metadata for efficient similarity search
Query-Time Retrieval
Convert user question into a vector using the same embedding model
Find the most similar document chunks by comparing vectors
Combine retrieved chunks with the user question
Send to LLM and return the answer
Choosing a Vector Database
The vector database is where your document embeddings live. Your choice affects query speed, cost, and operational complexity.
| Database | Type | Free Tier | Best For |
|---|---|---|---|
| Pinecone | Managed SaaS | 100K vectors | Production, scale, low latency |
| Supabase pgvector | PostgreSQL extension | 500MB database | Already using Supabase, simplicity |
| Weaviate | Managed/Self-hosted | 14-day trial | Hybrid search, multi-modal |
| Chroma | Open source/Cloud | Self-hosted free | Local dev, prototyping |
| Qdrant | Managed/Self-hosted | 1M vectors cloud | High performance, filtering |
Recommendation: If you’re already using Supabase for auth and database, pgvector is the simplest path. For larger scale or lower latency requirements, Pinecone is the industry standard.
Implementation with Supabase pgvector
Let’s build a complete RAG system using Supabase for both the vector database and storage. This keeps your infrastructure simple.
Step 1: Enable the Vector Extension
-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table for document chunks
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding vector(1536), -- OpenAI text-embedding-3-small dimension
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create index for similarity search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);Step 2: Create the Embedding Service
// lib/embeddings.ts
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
export async function generateEmbedding(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
}
export async function generateEmbeddings(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: texts,
});
return response.data.map(item => item.embedding);
}Step 3: Document Chunking
Chunking strategy significantly impacts retrieval quality. Here’s a basic approach:
// lib/chunking.ts
interface Chunk {
content: string;
metadata: Record<string, unknown>;
}
export function chunkText(
text: string,
chunkSize: number = 1000,
overlap: number = 200
): Chunk[] {
const chunks: Chunk[] = [];
let start = 0;
while (start < text.length) {
// Find a good break point (end of sentence)
let end = start + chunkSize;
if (end < text.length) {
// Look for sentence boundary
const lastPeriod = text.lastIndexOf('.', end);
const lastNewline = text.lastIndexOf('\n', end);
const breakPoint = Math.max(lastPeriod, lastNewline);
if (breakPoint > start + chunkSize / 2) {
end = breakPoint + 1;
}
}
const content = text.slice(start, end).trim();
if (content.length > 0) {
chunks.push({
content,
metadata: {
startIndex: start,
endIndex: end,
},
});
}
start = end - overlap;
}
return chunks;
}Step 4: Document Ingestion API
// app/api/ingest/route.ts
import { NextResponse } from 'next/server';
import { createClient } from '@/lib/supabase/server';
import { generateEmbeddings } from '@/lib/embeddings';
import { chunkText } from '@/lib/chunking';
export async function POST(request: Request) {
try {
const { content, metadata } = await request.json();
// Chunk the document
const chunks = chunkText(content);
// Generate embeddings for all chunks
const embeddings = await generateEmbeddings(
chunks.map(c => c.content)
);
// Store in Supabase
const supabase = await createClient();
const documents = chunks.map((chunk, i) => ({
content: chunk.content,
metadata: { ...metadata, ...chunk.metadata },
embedding: embeddings[i],
}));
const { error } = await supabase
.from('documents')
.insert(documents);
if (error) throw error;
return NextResponse.json({
success: true,
chunksProcessed: chunks.length,
});
} catch (error) {
console.error('Ingestion failed:', error);
return NextResponse.json(
{ error: 'Ingestion failed' },
{ status: 500 }
);
}
}Step 5: Similarity Search
// lib/retrieval.ts
import { createClient } from '@/lib/supabase/server';
import { generateEmbedding } from '@/lib/embeddings';
interface SearchResult {
content: string;
metadata: Record<string, unknown>;
similarity: number;
}
export async function searchDocuments(
query: string,
limit: number = 5
): Promise<SearchResult[]> {
const supabase = await createClient();
// Generate embedding for the query
const queryEmbedding = await generateEmbedding(query);
// Search for similar documents
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_threshold: 0.7,
match_count: limit,
});
if (error) throw error;
return data;
}Create the matching function in Supabase:
-- Supabase SQL function for similarity search
CREATE OR REPLACE FUNCTION match_documents(
query_embedding vector(1536),
match_threshold float,
match_count int
)
RETURNS TABLE (
content text,
metadata jsonb,
similarity float
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
documents.content,
documents.metadata,
1 - (documents.embedding <=> query_embedding) as similarity
FROM documents
WHERE 1 - (documents.embedding <=> query_embedding) > match_threshold
ORDER BY documents.embedding <=> query_embedding
LIMIT match_count;
END;
$$;Step 6: RAG Chat Endpoint
// app/api/chat/route.ts
import { NextResponse } from 'next/server';
import OpenAI from 'openai';
import { searchDocuments } from '@/lib/retrieval';
const openai = new OpenAI();
export async function POST(request: Request) {
try {
const { message } = await request.json();
// Retrieve relevant documents
const relevantDocs = await searchDocuments(message, 5);
// Build context from retrieved documents
const context = relevantDocs
.map(doc => doc.content)
.join('\n\n---\n\n');
// Generate response with context
const response = await openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain relevant information, say so.
Context:
${context}`,
},
{
role: 'user',
content: message,
},
],
temperature: 0.7,
max_tokens: 1000,
});
return NextResponse.json({
answer: response.choices[0].message.content,
sources: relevantDocs.map(doc => ({
content: doc.content.slice(0, 200) + '...',
similarity: doc.similarity,
})),
});
} catch (error) {
console.error('Chat failed:', error);
return NextResponse.json(
{ error: 'Failed to generate response' },
{ status: 500 }
);
}
}Processing Different Document Types
Real applications need to handle various file formats. Here’s how to process common types:
PDF Processing
// lib/loaders/pdf.ts
import pdf from 'pdf-parse';
export async function loadPDF(buffer: Buffer): Promise<string> {
const data = await pdf(buffer);
return data.text;
}
// Usage in API route
import { loadPDF } from '@/lib/loaders/pdf';
export async function POST(request: Request) {
const formData = await request.formData();
const file = formData.get('file') as File;
const buffer = Buffer.from(await file.arrayBuffer());
const text = await loadPDF(buffer);
// Then chunk and embed as before
}Web Page Scraping
// lib/loaders/web.ts
import * as cheerio from 'cheerio';
export async function loadWebPage(url: string): Promise<string> {
const response = await fetch(url);
const html = await response.text();
const $ = cheerio.load(html);
// Remove script and style elements
$('script, style, nav, footer, header').remove();
// Get main content
const content = $('main, article, .content').text()
|| $('body').text();
// Clean up whitespace
return content.replace(/\s+/g, ' ').trim();
}Improving RAG Quality
Basic RAG works, but production systems need optimizations to handle edge cases and improve accuracy.
Hybrid Search
Combine semantic search with keyword search for better results:
- Semantic search finds conceptually similar content (“car” matches “automobile”)
- Keyword search ensures exact matches aren’t missed (“API-KEY-123” finds that exact string)
- Hybrid combines both, typically weighted 0.7 semantic / 0.3 keyword
Better Chunking Strategies
Semantic Chunking
Split at natural boundaries (paragraphs, sections) rather than fixed character counts.
Parent-Child Chunking
Retrieve smaller chunks for precision, but include parent context for completeness.
Overlapping Windows
Chunks overlap by 10-20% to avoid splitting important context at boundaries.
Re-ranking Retrieved Results
Initial retrieval is fast but imperfect. A re-ranking step can improve accuracy:
// Re-rank with a cross-encoder for better accuracy
import Anthropic from '@anthropic-ai/sdk';
async function rerankResults(
query: string,
results: SearchResult[]
): Promise<SearchResult[]> {
const anthropic = new Anthropic();
const prompt = `Given the query and documents below, rank the documents from most to least relevant.
Query: ${query}
Documents:
${results.map((r, i) => `[${i}] ${r.content.slice(0, 500)}`).join('\n\n')}
Return only the document indices in order of relevance, comma-separated.`;
const response = await anthropic.messages.create({
model: 'claude-3-haiku-20240307',
max_tokens: 100,
messages: [{ role: 'user', content: prompt }],
});
// Parse and reorder results
const order = response.content[0].text
.split(',')
.map(s => parseInt(s.trim()));
return order.map(i => results[i]).filter(Boolean);
}Production Considerations
Cost Optimization
Cache Embeddings
Don’t re-embed the same query twice. Cache frequently asked questions and their embeddings.
Use Smaller Models First
Route simple queries to cheaper models. Only use GPT-4 when necessary.
Batch Ingestion
Process multiple documents together to reduce API calls.
Choose the Right Embedding Model
text-embedding-3-small is 5x cheaper than text-embedding-3-large with good quality.
Rate Limiting and Error Handling
// lib/rateLimiter.ts
const requestCounts = new Map<string, number[]>();
export function isRateLimited(userId: string, limit: number = 10): boolean {
const now = Date.now();
const windowMs = 60000; // 1 minute window
const requests = requestCounts.get(userId) || [];
const recentRequests = requests.filter(time => now - time < windowMs);
if (recentRequests.length >= limit) {
return true;
}
recentRequests.push(now);
requestCounts.set(userId, recentRequests);
return false;
}Monitoring and Observability
Key Metrics to Track:
- • Query latency (embedding + retrieval + generation)
- • Retrieval quality (are relevant docs being found?)
- • Cost per query (embedding + LLM tokens)
- • Error rates by query type
- • User feedback on answer quality
Common Pitfalls
Pitfall: Chunks Too Large or Too Small
Large chunks waste context window. Small chunks lose meaning.
Solution: Start with 500-1000 tokens per chunk. Test and adjust based on your content.
Pitfall: Ignoring Metadata
Without source tracking, users can’t verify answers.
Solution: Always store and return source documents, page numbers, and timestamps.
Pitfall: Not Handling “I Don’t Know”
LLMs will hallucinate answers if context doesn’t contain relevant info.
Solution: Check similarity scores. If too low, tell the user you don’t have that information.
Pitfall: Stale Data
Documents change. Your embeddings need to stay current.
Solution: Implement document versioning and scheduled re-ingestion.
Summary
RAG transforms a generic AI chatbot into a knowledgeable assistant that understands your specific domain. The core concepts are straightforward:
- • Ingest documents by chunking and embedding them
- • Store embeddings in a vector database for fast similarity search
- • Retrieve context at query time based on semantic similarity
- • Generate answers using the LLM with retrieved context
- • Optimize with better chunking, hybrid search, and re-ranking
Implementation details matter. Chunking strategy, embedding model choice, and prompt engineering all affect quality. Start simple, measure results, and iterate.
Need Help Building Your RAG System?
RAG implementation involves many moving parts. If you want expert guidance on building a production-ready RAG system for your Bolt app, we can help.
Schedule a CallReady to Build Your MVP?
Need help turning your idea into reality? Our team has built 50+ successful startup MVPs and knows exactly what it takes to validate your idea quickly and cost-effectively.