用 Python 和本地 LLM 搭建 RAG 系统(零 API 费用)
RAG(检索增强生成)是 2026 年最受欢迎的 LLM 技术方向之一。本文提供完整实战教程:只需 Python 和本地运行的 LLM,无需任何付费 API,即可搭建一个能回答特定知识库问题的 RAG 系统。
涵盖文档切片与向量化、本地 embedding 模型选型、检索逻辑设计,以及如何将所有组件整合成可实际使用的系统。适合希望在本地环境学习 RAG 的开发者。
作者在文章中提供了完整的实现代码和步骤说明,读者可以按照教程一步步复现。文章结合实际项目经验,深入浅出地讲解了技术原理和实践中的常见陷阱。评论区也有不少有价值的补充讨论,建议对该技术感兴趣的开发者深入阅读原文。
Build a RAG System with Python and a Local LLM (No API Costs) - DEV Community
Build a RAG System with Python and a Local LLM (No API Costs)
Build a RAG System with Python and a Local LLM (No API Costs)
RAG (Retrieval-Augmented Generation) is the most in-demand LLM skill in 2026. Every company wants to point an AI at their docs, their codebase, their knowledge base — and get useful answers back.
The typical stack involves OpenAI embeddings + GPT-4 + a vector DB. The typical bill involves a credit card.
Here's how to build the same thing entirely on local hardware: Python + Ollama + ChromaDB. No API keys. No per-token costs. Runs on a laptop or a home server.
Ingests documents (text files, markdown, PDFs)
Embeds them using a local model
Stores vectors in ChromaDB (local, in-memory or persistent)
Retrieves relevant chunks on query
Generates an answer using a local LLM via Ollama
Total cloud cost: $0.
Ollama installed with at least one model pulled
8 GB RAM minimum (16 GB recommended for 14B models)
Install dependencies
pip install chromadb ollama requests
Pull models — one for embeddings, one for generation
ollama pull nomic-embed-text # Fast, purpose-built embedding model
ollama pull qwen2.5:14b # Generation model
Enter fullscreen mode
Step 1: Document Ingestion
from pathlib import Path
def load_documents(docs_dir: str) -> list[dict]:
Load text documents from a directory.
Returns list of {content, source, chunk_id} dicts.
patterns = ['**/*.txt', '**/*.md', '**/*.py', '**/*.rst']
for pattern in patterns:
for filepath in glob.glob(os.path.join(docs_dir, pattern), recursive=True):
with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
if len(content.strip()) < 50:
continue # Skip tiny files
chunks = chunk_text(content, chunk_size=500, overlap=50)
for i, chunk in enumerate(chunks):
'chunk_id': f"{Path(filepath).stem}_{i}"
except Exception as e:
print(f"[warn] Skipping {filepath}: {e}")
print(f"[ingest] Loaded {len(documents)} chunks from {docs_dir}")
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks by word count."""
while i < len(words):
chunk = ' '.join(words[i:i + chunk_size])
i += chunk_size - overlap # Slide with overlap
Enter fullscreen mode
Step 2: Local Embeddings with Ollama
nomic-embed-text is a purpose-built embedding model — fast, small (274M params), and genuinely good at semantic similarity.
def embed_texts(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
Generate embeddings for a list of texts using Ollama.
Returns list of embedding vectors.
for i, text in enumerate(texts):
print(f"[embed] Processing chunk {i}/{len(texts)}...")
response = ollama.embeddings(model=model, prompt=text)
embeddings.append(response['embedding'])
Enter fullscreen mode
Step 3: Vector Storage with ChromaDB
from chromadb.config import Settings
def build_vector_store(
documents: list[dict],
embeddings: list[list[float]],
collection_name: str = "local_rag",
persist_dir: str = "./chroma_db"