Retrieval Augmented Generation (RAG): Build AI with Your Data
ChatGPT knows general information, but it doesn't know about your company's products, your internal processes, or your customer data. When you ask it company-specific questions, it hallucinates or says "I don't have that information."
Retrieval Augmented Generation (RAG) solves this. It lets AI access your data in real-time, answer questions accurately, and cite sources—without expensive model retraining.
What You'll Learn
- What RAG is and how it differs from fine-tuning
- How RAG systems work (retrieval + generation)
- When to use RAG vs alternatives
- Building a simple RAG system
- Real-world applications for business
- Tools and frameworks to get started
What is RAG?
RAG = Retrieval Augmented Generation
Simple explanation: Before answering your question, the AI:
- Searches your documents/database for relevant information
- Includes that information in its prompt
- Generates an answer based on what it found
Why it's powerful: AI gets access to up-to-date, domain-specific information without retraining the model.
RAG vs Fine-Tuning vs Prompt Engineering
Traditional Prompting
How it works: Include information directly in prompt
Prompt: "Based on this document: [paste 50 pages], answer the question: What's our refund policy?"
Limitations:
- ❌ Doesn't scale (context window limits)
- ❌ Manual retrieval and pasting
- ❌ Can't search across hundreds of documents
Best for: One-off questions, small documents
Fine-Tuning
How it works: Retrain model on your data to "memorize" information
Limitations:
- ❌ Expensive ($100s-$1000s per training run)
- ❌ Time-consuming (hours to days)
- ❌ Static (outdated as soon as data changes)
- ❌ No source citation
- ❌ Can hallucinate "remembered" facts
Best for: Teaching model new formats, styles, or tasks—not facts
RAG (Retrieval Augmented Generation)
How it works: AI retrieves relevant information before generating response
Advantages:
- âś… Access to large knowledge bases
- âś… Always up-to-date (queries current data)
- âś… Cites sources
- âś… Cost-effective
- âś… Fast to implement
Best for: Question answering over documents, customer support, internal knowledge bases
How RAG Works (Technical Overview)
The 5 Steps
1. Index: Convert documents to vectors (embeddings) Documents → Chunks → Embeddings → Vector Database 2. Query: User asks question "What's our enterprise pricing?" 3. Retrieve: Find most relevant chunks Search vector database for similar embeddings 4. Augment: Add retrieved info to prompt "Based on these docs: [relevant chunks], answer: [question]" 5. Generate: AI produces answer with citations "Enterprise pricing starts at $500/month [source: pricing-doc.pdf p.3]"
Embeddings Explained
Embedding = numerical representation of text meaning
Example:
"dog" → [0.2, 0.8, 0.1, ...] (384 numbers) "puppy" → [0.21, 0.79, 0.11, ...] (very similar numbers) "car" → [0.9, 0.1, 0.05, ...] (very different numbers)
Why useful: Can measure similarity mathematically
- "dog" is similar to "puppy" (close vectors)
- "dog" is different from "car" (distant vectors)
In RAG:
- Documents stored as embeddings
- Query converted to embedding
- Find document embeddings closest to query embedding
- Those are most relevant documents
When to Use RAG
âś… Great Use Cases
Customer support knowledge base
- Thousands of support articles
- Policies change frequently
- Need accurate, cited answers
Internal company knowledge
- Employee handbook, policies, procedures
- Technical documentation
- Meeting notes, project docs
Research and analysis
- Scientific papers
- Legal documents
- Market research reports
Compliance and regulation
- Industry regulations
- Company compliance policies
- Audit documentation
Code documentation
- API references
- Codebase explanations
- Technical specifications
❌ Not Ideal For
Creative writing (doesn't need retrieval)
Style learning (fine-tuning better)
Real-time web search (different architecture needed)
Tiny document sets (<10 pages—just use context window)
Highly structured queries (traditional database better)
Building a Simple RAG System
Tools You'll Need
Vector databases (stores embeddings):
- Pinecone (hosted, easy)
- Weaviate (hosted or self-hosted)
- ChromaDB (lightweight, local)
- FAISS (open source, fast)
LLM frameworks:
- LangChain (most popular, Python/JS)
- LlamaIndex (specialized for RAG)
- Haystack (flexible pipeline builder)
Embedding models:
- OpenAI
text-embedding-ada-002 - Sentence Transformers (open source)
- Cohere Embeddings
Basic RAG Implementation (Python)
Install dependencies:
1pip install langchain openai chromadb
Simple example:
1from langchain.document_loaders import DirectoryLoader, TextLoader2from langchain.text_splitter import RecursiveCharacterTextSplitter3from langchain.embeddings import OpenAIEmbeddings4from langchain.vectorstores import Chroma5from langchain.chains import RetrievalQA6from langchain.llms import OpenAI7import os89# Set OpenAI API key10os.environ["OPENAI_API_KEY"] = "your-api-key"1112# Step 1: Load documents13loader = DirectoryLoader('./docs', glob="**/*.txt", loader_cls=TextLoader)14documents = loader.load()1516# Step 2: Split into chunks (important for context fitting)17text_splitter = RecursiveCharacterTextSplitter(18 chunk_size=1000,19 chunk_overlap=200 # Overlap helps maintain context20)21texts = text_splitter.split_documents(documents)2223print(f"Split into {len(texts)} chunks")2425# Step 3: Create embeddings and store in vector database26embeddings = OpenAIEmbeddings()27vectorstore = Chroma.from_documents(28 documents=texts,29 embedding=embeddings,30 persist_directory="./chroma_db"31)3233print("âś… Vector database created")3435# Step 4: Create retrieval chain36qa_chain = RetrievalQA.from_chain_type(37 llm=OpenAI(temperature=0), # Low temp for factual answers38 chain_type="stuff",39 retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), # Retrieve top 3 chunks40 return_source_documents=True41)4243# Step 5: Ask questions44def ask_question(question):45 result = qa_chain({"query": question})4647 print(f"\nâť“ Question: {question}")48 print(f"âś… Answer: {result['result']}")4950 print("\nđź“„ Sources:")51 for doc in result['source_documents']:52 print(f"- {doc.metadata['source']}")5354# Usage55ask_question("What is our refund policy?")56ask_question("How do I reset my password?")57ask_question("What are the enterprise pricing tiers?")
What this does:
- Loads all text files from
./docsdirectory - Splits into ~1000 character chunks
- Creates embeddings for each chunk
- Stores in ChromaDB (local vector database)
- When you ask a question:
- Finds 3 most relevant chunks
- Sends to OpenAI with context
- Returns answer with sources
Advanced RAG Techniques
1. Hybrid Search (Keyword + Semantic)
Combine traditional keyword search with vector similarity:
1# Combine BM25 (keyword) with vector search2from langchain.retrievers import BM25Retriever, EnsembleRetriever34keyword_retriever = BM25Retriever.from_documents(documents)5vector_retriever = vectorstore.as_retriever()67ensemble_retriever = EnsembleRetriever(8 retrievers=[keyword_retriever, vector_retriever],9 weights=[0.5, 0.5] # Equal weight to both10)
When to use: When exact keyword matches are important (product codes, names, technical terms)
2. Re-Ranking
Retrieve many candidates, then re-rank with more sophisticated model:
1from langchain.retrievers import ContextualCompressionRetriever2from langchain.retrievers.document_compressors import LLMChainExtractor34# Retrieve 10, compress to most relevant 35compressor = LLMChainExtractor.from_llm(llm)6compression_retriever = ContextualCompressionRetriever(7 base_compressor=compressor,8 base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})9)
Improves accuracy by using LLM to filter irrelevant chunks
3. Metadata Filtering
Filter by document properties before semantic search:
1# Only search within specific department docs2retriever = vectorstore.as_retriever(3 search_kwargs={4 "k": 5,5 "filter": {"department": "engineering"}6 }7)
Use cases: Multi-tenant systems, department-specific knowledge, date filtering
4. Query Transformation
Improve retrieval by rewriting user queries:
1from langchain.retrievers import MultiQueryRetriever23# Generates 3 variations of user query, retrieves for each4multi_query_retriever = MultiQueryRetriever.from_llm(5 retriever=vectorstore.as_retriever(),6 llm=llm7)89# User asks: "How do I change my email?"10# System generates:11# - "Steps to update email address"12# - "Email modification process"13# - "Change account email settings"
Improves recall by trying different phrasings
Real-World Applications
Internal Knowledge Base
Example: Employee handbook + policies chatbot
1# Load from multiple sources2from langchain.document_loaders import (3 PyPDFLoader,4 UnstructuredWordDocumentLoader,5 WebBaseLoader6)78loaders = [9 PyPDFLoader("handbook.pdf"),10 UnstructuredWordDocumentLoader("policies.docx"),11 WebBaseLoader("https://intranet.company.com/guidelines")12]1314documents = []15for loader in loaders:16 documents.extend(loader.load())1718# Build RAG system...19# Now employees can ask: "What's the remote work policy?"
Customer Support AI
Example: Answer customer questions using help docs
1# Add conversation memory for follow-ups2from langchain.memory import ConversationBufferMemory34memory = ConversationBufferMemory(5 memory_key="chat_history",6 return_messages=True7)89qa_chain = ConversationalRetrievalChain.from_llm(10 llm=llm,11 retriever=vectorstore.as_retriever(),12 memory=memory13)1415# Customer: "How do I cancel?"16# AI: "You can cancel by... [based on docs]"17# Customer: "What about refunds?"18# AI: "Regarding refunds after cancellation... [remembers context]"
Code Documentation Assistant
Example: Query codebase documentation
1# Load code files2from langchain.document_loaders import DirectoryLoader3from langchain.document_loaders import PythonLoader45loader = DirectoryLoader(6 './src',7 glob="**/*.py",8 loader_cls=PythonLoader9)1011docs = loader.load()1213# Ask: "How does the authentication module work?"14# AI: Retrieves auth.py, auth_helpers.py, explains based on code
Measuring RAG Performance
Key Metrics
Retrieval metrics:
- Precision: % of retrieved docs that are relevant
- Recall: % of relevant docs that were retrieved
- MRR (Mean Reciprocal Rank): Position of first relevant doc
Generation metrics:
- Accuracy: Correctness of answer
- Groundedness: Answer based on retrieved docs (not hallucinated)
- Relevance: Answer addresses the question
Testing Approach
Create test set:
1test_cases = [2 {3 "question": "What is our return policy?",4 "expected_answer": "30 days with receipt",5 "expected_sources": ["policies.pdf"]6 },7 # ... more cases8]910for case in test_cases:11 result = qa_chain({"query": case["question"]})12 # Evaluate if answer is correct and sources match
Common Pitfalls
❌ Chunks too large: AI loses focus, includes irrelevant info
âś… Keep chunks 500-1500 characters
❌ Chunks too small: Loses context, incomplete information
âś… Use overlap (100-200 chars) to maintain context
❌ Poor document structure: Headings split across chunks
âś… Use semantic splitters that respect document structure
❌ Not citing sources: Users can't verify information
âś… Always return and display source documents
❌ Hallucination: AI makes up answers not in docs
âś… Use low temperature, prompt to only answer from context
❌ Outdated data: Vector store not refreshed
âś… Implement refresh pipeline when docs update
Tools and Platforms
No-code/Low-code:
- Mendable (documentation chatbot)
- ChatBase (custom chatbot builder)
- Glean (enterprise knowledge search)
Developer frameworks:
- LangChain (comprehensive, popular)
- LlamaIndex (RAG-focused)
- Haystack (NLP pipelines)
Vector databases:
- Pinecone (managed, easy)
- Weaviate (open source)
- Qdrant (high performance)
- Milvus (scalable)
Embedding models:
- OpenAI (best quality, paid)
- Cohere (good quality, paid)
- Sentence Transformers (free, open source)
Cost Considerations
Typical RAG system costs:
-
Embeddings: $0.0001 per 1K tokens (OpenAI)
- 1000 documents (~500K tokens) = ~$50 one-time
-
Vector storage: $0.096 per GB/month (Pinecone)
- 1000 documents = ~100 MB = ~$0.01/month
-
LLM calls: $0.03 per 1K output tokens (GPT-4 Turbo)
- 1000 queries with 500 token answers = ~$15
Total for 1000-document system with 1000 monthly queries: ~$65 first month, ~$15/month ongoing
Much cheaper than:
- Building custom search engine
- Training custom models
- Hiring human support for every question
Key Takeaways
- RAG = Retrieval + Generation for accurate, cited answers
- Better than fine-tuning for factual knowledge
- Always up-to-date since it queries current data
- Embeddings + vector search enable semantic retrieval
- Multiple techniques improve accuracy (hybrid search, re-ranking)
- Real-world applications in support, knowledge management, code docs
- Cost-effective compared to alternatives
Conclusion
RAG democratizes AI for business applications. You don't need massive budgets or ML PhDs to build AI that knows your company's data. With frameworks like LangChain and vector databases like Pinecone, you can build a working RAG system in an afternoon.
Start small: one use case, one set of documents. Get it working, measure accuracy, refine. Then expand to other knowledge bases. Soon you'll have AI assistants that actually know your business—and cite their sources.
Your company's knowledge just became AI-accessible.
Related articles: GPT-4 vs Claude 3: Which AI for Work, Context Windows in AI: Why Size Matters
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.
