Build a RAG System with Python and a Local LLM (No API Costs)

RAG (Retrieval-Augmented Generation) is one of the most in-demand LLM skills in 2026. This tutorial provides a complete hands-on guide to building a RAG system using only Python and a local LLM — no paid APIs required.

Covers document chunking and vectorization, local embedding model selection, retrieval logic design, and integrating all components into a working system. Ideal for developers learning RAG in a local environment.

Build a RAG System with Python and a Local LLM (No API Costs) - DEV Community

Build a RAG System with Python and a Local LLM (No API Costs)

Build a RAG System with Python and a Local LLM (No API Costs)

RAG (Retrieval-Augmented Generation) is the most in-demand LLM skill in 2026. Every company wants to point an AI at their docs, their codebase, their knowledge base — and get useful answers back.

The typical stack involves OpenAI embeddings + GPT-4 + a vector DB. The typical bill involves a credit card.

Here's how to build the same thing entirely on local hardware: Python + Ollama + ChromaDB. No API keys. No per-token costs. Runs on a laptop or a home server.

Ingests documents (text files, markdown, PDFs)

Embeds them using a local model

Stores vectors in ChromaDB (local, in-memory or persistent)

Retrieves relevant chunks on query

Generates an answer using a local LLM via Ollama

Total cloud cost: $0.

Ollama installed with at least one model pulled

8 GB RAM minimum (16 GB recommended for 14B models)

Install dependencies

pip install chromadb ollama requests

Pull models — one for embeddings, one for generation

ollama pull nomic-embed-text # Fast, purpose-built embedding model

ollama pull qwen2.5:14b # Generation model

Enter fullscreen mode

Step 1: Document Ingestion

from pathlib import Path

def load_documents(docs_dir: str) -> list[dict]:

Load text documents from a directory.

Returns list of {content, source, chunk_id} dicts.

patterns = ['**/*.txt', '**/*.md', '**/*.py', '**/*.rst']

for pattern in patterns:

for filepath in glob.glob(os.path.join(docs_dir, pattern), recursive=True):

with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:

if len(content.strip()) < 50:

continue # Skip tiny files

chunks = chunk_text(content, chunk_size=500, overlap=50)

for i, chunk in enumerate(chunks):

'chunk_id': f"{Path(filepath).stem}_{i}"

except Exception as e:

print(f"[warn] Skipping {filepath}: {e}")

print(f"[ingest] Loaded {len(documents)} chunks from {docs_dir}")

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:

"""Split text into overlapping chunks by word count."""

while i < len(words):

chunk = ' '.join(words[i:i + chunk_size])

i += chunk_size - overlap # Slide with overlap

Enter fullscreen mode

Step 2: Local Embeddings with Ollama

nomic-embed-text is a purpose-built embedding model — fast, small (274M params), and genuinely good at semantic similarity.

def embed_texts(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:

Generate embeddings for a list of texts using Ollama.

Returns list of embedding vectors.

for i, text in enumerate(texts):

print(f"[embed] Processing chunk {i}/{len(texts)}...")

response = ollama.embeddings(model=model, prompt=text)

embeddings.append(response['embedding'])

Enter fullscreen mode

Step 3: Vector Storage with ChromaDB

from chromadb.config import Settings

def build_vector_store(

documents: list[dict],

embeddings: list[list[float]],

collection_name: str = "local_rag",

persist_dir: str = "./chroma_db"