Blog / AI Tool Development

Building RAG with Bolt: Complete Implementation Guide

Learn how to implement Retrieval-Augmented Generation (RAG) in your Bolt.new app. Complete guide covering vector databases, embeddings, document processing, and production optimization.

ShipAi Team
18 min read
Building RAG with Bolt: Complete Implementation Guide

You’ve built an AI-powered app with Bolt. It has a chat interface, connects to an LLM, and can answer questions. But there’s a problem: it only knows what the LLM was trained on. Ask it about your company’s documentation, your product specs, or last week’s meeting notes, and it draws a blank.

This is where Retrieval-Augmented Generation (RAG) comes in. Instead of relying solely on the LLM’s training data, RAG retrieves relevant information from your own documents and uses it to generate more accurate, contextual responses. It’s how you make an AI that actually knows about your specific domain.

This guide covers everything you need to implement RAG in a Bolt app: from understanding the architecture to choosing a vector database, processing documents, and optimizing for production use.

What Is RAG and Why Does It Matter?

RAG combines the generative capabilities of LLMs with external knowledge retrieval. Instead of asking an LLM to answer from memory, you first search for relevant documents, then feed those documents to the LLM along with the question.

The RAG Process in Simple Terms:

  1. 1. User asks a question
  2. 2. System searches your documents for relevant content
  3. 3. Retrieved content is added to the LLM prompt
  4. 4. LLM generates an answer using that context

When to Use RAG

Good Use Cases

  • • Customer support chatbots with knowledge bases
  • • Internal documentation search
  • • Legal or compliance document Q&A
  • • Product information assistants
  • • Research paper analysis tools
  • • Personalized learning assistants

Not Ideal For

  • • General knowledge questions (use LLM directly)
  • • Creative writing without reference material
  • • Real-time data (use API integrations)
  • • Simple structured data queries (use SQL)
  • • Tasks requiring zero hallucination guarantee

RAG Architecture Overview

A production RAG system has two main phases: ingestion (processing documents into searchable format) and retrieval (finding relevant content at query time).

Document Ingestion Pipeline

1
Load Documents

Read PDFs, web pages, markdown files, etc.

2
Split into Chunks

Break documents into smaller, manageable pieces (typically 500-1000 tokens)

3
Generate Embeddings

Convert text chunks into numerical vectors using an embedding model

4
Store in Vector Database

Save embeddings with metadata for efficient similarity search

Query-Time Retrieval

1
Embed the Query

Convert user question into a vector using the same embedding model

2
Vector Similarity Search

Find the most similar document chunks by comparing vectors

3
Build Context Prompt

Combine retrieved chunks with the user question

4
Generate Response

Send to LLM and return the answer

Choosing a Vector Database

The vector database is where your document embeddings live. Your choice affects query speed, cost, and operational complexity.

DatabaseTypeFree TierBest For
PineconeManaged SaaS100K vectorsProduction, scale, low latency
Supabase pgvectorPostgreSQL extension500MB databaseAlready using Supabase, simplicity
WeaviateManaged/Self-hosted14-day trialHybrid search, multi-modal
ChromaOpen source/CloudSelf-hosted freeLocal dev, prototyping
QdrantManaged/Self-hosted1M vectors cloudHigh performance, filtering

Recommendation: If you’re already using Supabase for auth and database, pgvector is the simplest path. For larger scale or lower latency requirements, Pinecone is the industry standard.

Implementation with Supabase pgvector

Let’s build a complete RAG system using Supabase for both the vector database and storage. This keeps your infrastructure simple.

Step 1: Enable the Vector Extension

-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table for document chunks
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  embedding vector(1536),  -- OpenAI text-embedding-3-small dimension
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create index for similarity search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Step 2: Create the Embedding Service

// lib/embeddings.ts
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function generateEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });

  return response.data[0].embedding;
}

export async function generateEmbeddings(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts,
  });

  return response.data.map(item => item.embedding);
}

Step 3: Document Chunking

Chunking strategy significantly impacts retrieval quality. Here’s a basic approach:

// lib/chunking.ts

interface Chunk {
  content: string;
  metadata: Record<string, unknown>;
}

export function chunkText(
  text: string,
  chunkSize: number = 1000,
  overlap: number = 200
): Chunk[] {
  const chunks: Chunk[] = [];
  let start = 0;

  while (start < text.length) {
    // Find a good break point (end of sentence)
    let end = start + chunkSize;

    if (end < text.length) {
      // Look for sentence boundary
      const lastPeriod = text.lastIndexOf('.', end);
      const lastNewline = text.lastIndexOf('\n', end);
      const breakPoint = Math.max(lastPeriod, lastNewline);

      if (breakPoint > start + chunkSize / 2) {
        end = breakPoint + 1;
      }
    }

    const content = text.slice(start, end).trim();

    if (content.length > 0) {
      chunks.push({
        content,
        metadata: {
          startIndex: start,
          endIndex: end,
        },
      });
    }

    start = end - overlap;
  }

  return chunks;
}

Step 4: Document Ingestion API

// app/api/ingest/route.ts
import { NextResponse } from 'next/server';
import { createClient } from '@/lib/supabase/server';
import { generateEmbeddings } from '@/lib/embeddings';
import { chunkText } from '@/lib/chunking';

export async function POST(request: Request) {
  try {
    const { content, metadata } = await request.json();

    // Chunk the document
    const chunks = chunkText(content);

    // Generate embeddings for all chunks
    const embeddings = await generateEmbeddings(
      chunks.map(c => c.content)
    );

    // Store in Supabase
    const supabase = await createClient();

    const documents = chunks.map((chunk, i) => ({
      content: chunk.content,
      metadata: { ...metadata, ...chunk.metadata },
      embedding: embeddings[i],
    }));

    const { error } = await supabase
      .from('documents')
      .insert(documents);

    if (error) throw error;

    return NextResponse.json({
      success: true,
      chunksProcessed: chunks.length,
    });
  } catch (error) {
    console.error('Ingestion failed:', error);
    return NextResponse.json(
      { error: 'Ingestion failed' },
      { status: 500 }
    );
  }
}

Step 5: Similarity Search

// lib/retrieval.ts
import { createClient } from '@/lib/supabase/server';
import { generateEmbedding } from '@/lib/embeddings';

interface SearchResult {
  content: string;
  metadata: Record<string, unknown>;
  similarity: number;
}

export async function searchDocuments(
  query: string,
  limit: number = 5
): Promise<SearchResult[]> {
  const supabase = await createClient();

  // Generate embedding for the query
  const queryEmbedding = await generateEmbedding(query);

  // Search for similar documents
  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: queryEmbedding,
    match_threshold: 0.7,
    match_count: limit,
  });

  if (error) throw error;

  return data;
}

Create the matching function in Supabase:

-- Supabase SQL function for similarity search
CREATE OR REPLACE FUNCTION match_documents(
  query_embedding vector(1536),
  match_threshold float,
  match_count int
)
RETURNS TABLE (
  content text,
  metadata jsonb,
  similarity float
)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY
  SELECT
    documents.content,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  FROM documents
  WHERE 1 - (documents.embedding <=> query_embedding) > match_threshold
  ORDER BY documents.embedding <=> query_embedding
  LIMIT match_count;
END;
$$;

Step 6: RAG Chat Endpoint

// app/api/chat/route.ts
import { NextResponse } from 'next/server';
import OpenAI from 'openai';
import { searchDocuments } from '@/lib/retrieval';

const openai = new OpenAI();

export async function POST(request: Request) {
  try {
    const { message } = await request.json();

    // Retrieve relevant documents
    const relevantDocs = await searchDocuments(message, 5);

    // Build context from retrieved documents
    const context = relevantDocs
      .map(doc => doc.content)
      .join('\n\n---\n\n');

    // Generate response with context
    const response = await openai.chat.completions.create({
      model: 'gpt-4-turbo-preview',
      messages: [
        {
          role: 'system',
          content: `You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain relevant information, say so.

Context:
${context}`,
        },
        {
          role: 'user',
          content: message,
        },
      ],
      temperature: 0.7,
      max_tokens: 1000,
    });

    return NextResponse.json({
      answer: response.choices[0].message.content,
      sources: relevantDocs.map(doc => ({
        content: doc.content.slice(0, 200) + '...',
        similarity: doc.similarity,
      })),
    });
  } catch (error) {
    console.error('Chat failed:', error);
    return NextResponse.json(
      { error: 'Failed to generate response' },
      { status: 500 }
    );
  }
}

Processing Different Document Types

Real applications need to handle various file formats. Here’s how to process common types:

PDF Processing

// lib/loaders/pdf.ts
import pdf from 'pdf-parse';

export async function loadPDF(buffer: Buffer): Promise<string> {
  const data = await pdf(buffer);
  return data.text;
}

// Usage in API route
import { loadPDF } from '@/lib/loaders/pdf';

export async function POST(request: Request) {
  const formData = await request.formData();
  const file = formData.get('file') as File;

  const buffer = Buffer.from(await file.arrayBuffer());
  const text = await loadPDF(buffer);

  // Then chunk and embed as before
}

Web Page Scraping

// lib/loaders/web.ts
import * as cheerio from 'cheerio';

export async function loadWebPage(url: string): Promise<string> {
  const response = await fetch(url);
  const html = await response.text();

  const $ = cheerio.load(html);

  // Remove script and style elements
  $('script, style, nav, footer, header').remove();

  // Get main content
  const content = $('main, article, .content').text()
    || $('body').text();

  // Clean up whitespace
  return content.replace(/\s+/g, ' ').trim();
}

Improving RAG Quality

Basic RAG works, but production systems need optimizations to handle edge cases and improve accuracy.

Hybrid Search

Combine semantic search with keyword search for better results:

  • Semantic search finds conceptually similar content (“car” matches “automobile”)
  • Keyword search ensures exact matches aren’t missed (“API-KEY-123” finds that exact string)
  • Hybrid combines both, typically weighted 0.7 semantic / 0.3 keyword

Better Chunking Strategies

Semantic Chunking

Split at natural boundaries (paragraphs, sections) rather than fixed character counts.

Parent-Child Chunking

Retrieve smaller chunks for precision, but include parent context for completeness.

Overlapping Windows

Chunks overlap by 10-20% to avoid splitting important context at boundaries.

Re-ranking Retrieved Results

Initial retrieval is fast but imperfect. A re-ranking step can improve accuracy:

// Re-rank with a cross-encoder for better accuracy
import Anthropic from '@anthropic-ai/sdk';

async function rerankResults(
  query: string,
  results: SearchResult[]
): Promise<SearchResult[]> {
  const anthropic = new Anthropic();

  const prompt = `Given the query and documents below, rank the documents from most to least relevant.

Query: ${query}

Documents:
${results.map((r, i) => `[${i}] ${r.content.slice(0, 500)}`).join('\n\n')}

Return only the document indices in order of relevance, comma-separated.`;

  const response = await anthropic.messages.create({
    model: 'claude-3-haiku-20240307',
    max_tokens: 100,
    messages: [{ role: 'user', content: prompt }],
  });

  // Parse and reorder results
  const order = response.content[0].text
    .split(',')
    .map(s => parseInt(s.trim()));

  return order.map(i => results[i]).filter(Boolean);
}

Production Considerations

Cost Optimization

Cache Embeddings

Don’t re-embed the same query twice. Cache frequently asked questions and their embeddings.

Use Smaller Models First

Route simple queries to cheaper models. Only use GPT-4 when necessary.

Batch Ingestion

Process multiple documents together to reduce API calls.

Choose the Right Embedding Model

text-embedding-3-small is 5x cheaper than text-embedding-3-large with good quality.

Rate Limiting and Error Handling

// lib/rateLimiter.ts
const requestCounts = new Map<string, number[]>();

export function isRateLimited(userId: string, limit: number = 10): boolean {
  const now = Date.now();
  const windowMs = 60000; // 1 minute window

  const requests = requestCounts.get(userId) || [];
  const recentRequests = requests.filter(time => now - time < windowMs);

  if (recentRequests.length >= limit) {
    return true;
  }

  recentRequests.push(now);
  requestCounts.set(userId, recentRequests);

  return false;
}

Monitoring and Observability

Key Metrics to Track:

  • • Query latency (embedding + retrieval + generation)
  • • Retrieval quality (are relevant docs being found?)
  • • Cost per query (embedding + LLM tokens)
  • • Error rates by query type
  • • User feedback on answer quality

Common Pitfalls

Pitfall: Chunks Too Large or Too Small

Large chunks waste context window. Small chunks lose meaning.

Solution: Start with 500-1000 tokens per chunk. Test and adjust based on your content.

Pitfall: Ignoring Metadata

Without source tracking, users can’t verify answers.

Solution: Always store and return source documents, page numbers, and timestamps.

Pitfall: Not Handling “I Don’t Know”

LLMs will hallucinate answers if context doesn’t contain relevant info.

Solution: Check similarity scores. If too low, tell the user you don’t have that information.

Pitfall: Stale Data

Documents change. Your embeddings need to stay current.

Solution: Implement document versioning and scheduled re-ingestion.

Summary

RAG transforms a generic AI chatbot into a knowledgeable assistant that understands your specific domain. The core concepts are straightforward:

  • Ingest documents by chunking and embedding them
  • Store embeddings in a vector database for fast similarity search
  • Retrieve context at query time based on semantic similarity
  • Generate answers using the LLM with retrieved context
  • Optimize with better chunking, hybrid search, and re-ranking

Implementation details matter. Chunking strategy, embedding model choice, and prompt engineering all affect quality. Start simple, measure results, and iterate.

Need Help Building Your RAG System?

RAG implementation involves many moving parts. If you want expert guidance on building a production-ready RAG system for your Bolt app, we can help.

Schedule a Call

Ready to Build Your MVP?

Need help turning your idea into reality? Our team has built 50+ successful startup MVPs and knows exactly what it takes to validate your idea quickly and cost-effectively.