Pathway Deep Dive: Real-Time Stream Processing and RAG with Python API and Rust Engine

Pathway is a unique stream processing framework with 60k+ GitHub stars: write business logic in Python, run it on a Rust engine. Built on Differential Dataflow for incremental computation—processing only data changes, making it naturally suited for real-time scenarios. The architecture centers on a declarative paradigm (define-then-run): developers define a complete computation graph in Python, and the system performs global optimization before executing at pw.run(). In-memory processing plus stateful operations (join/window/sort) ensure low latency and high throughput. RAG support is a standout feature: built-in real-time vector indexing automatically updates embeddings incrementally when documents change, eliminating the need for a separate vector database. Supports 300+ data source connectors (Kafka, PostgreSQL, SharePoint, GDrive, and more).

Pathway: A Deep Dive into Real-Time Stream Processing and RAG with Python API and Rust Engine

Why Pathway Deserves Serious Attention

The real-time data processing landscape has long been dominated by Apache Flink and Spark Streaming. Pathway challenges this status quo with a fundamentally different approach: write logic in Python, run it on a Rust engine, and leverage Differential Dataflow for incremental computation. With over 60,000 GitHub stars as of early 2026, Pathway has become one of the fastest-growing data processing frameworks in recent years. This deep dive explores Pathway's core architecture, the engineering trade-offs behind its design, how it compares to Flink and Spark, and its unique advantages for RAG (Retrieval-Augmented Generation) and AI pipeline scenarios. ---

Core Architecture: Rust

Engine + Differential Dataflow #

The Two-Layer Design

Pathway employs a clean frontend/backend separation architecture: - **Frontend (Python API)**: Developers write data transformations, aggregations, and joins using familiar Python syntax. Pathway's Table API resembles pandas but carries full streaming semantics. - **Backend (Rust Engine)**: All Python code is ultimately compiled into a Rust execution plan, which the Rust engine schedules and executes. The critical insight here is that Python is no longer a performance bottleneck. Traditional Python data processing is constrained by the GIL (Global Interpreter Lock), preventing true parallelism. Pathway sidesteps this entirely—Python is only used to *define* the computation graph, while actual execution happens in GIL-free Rust with native support for multithreading, multiprocessing, and distributed computation. #

Differential Dataflow: The Mathematical Foundation of Incremental Computation

Differential Dataflow is the most critical technical component underpinning Pathway, developed by Frank McSherry at Microsoft Research as part of the Naiad project. Its core insight: > Instead of recomputing the entire result set, propagate and process only the *differences* in data. Concretely, each data record carries a timestamp and a weight (+1 for insertion, -1 for deletion). When the data stream changes, the system processes only the "delta" and propagates those changes through the computation graph until all downstream operators' outputs are updated accordingly. What does this mean for stream processing? - **Native out-of-order handling**: Differential Dataflow's time model allows data to arrive in any order; the system automatically applies corrections within the appropriate time windows. - **Consistent stateful operations**: Intermediate state for joins, groupbys, and windows remains always consistent without complex checkpoint machinery. - **Memory efficiency**: Only incremental state is stored, not full state snapshots. #

The Declarative Paradigm: Define-then-Run

Pathway uses a `define-then-run` execution model, similar in spirit to TensorFlow's graph mode or Spark's lazy evaluation: ```python import pathway as pw # Step 1: Define data sources (no data flows yet) orders = pw.io.kafka.read( rdkafka_settings={"bootstrap.servers": "kafka:9092"}, topic="orders", schema=OrderSchema ) # Step 2: Define transformations (build computation graph, still no execution) enriched = orders.join( products, pw.left.product_id == pw.right.id ).select( order_id=pw.left.id, product_name=pw.right.name, amount=pw.left.quantity * pw.right.price ) # Step 3: Define output sink pw.io.postgres.write(enriched, pg_settings, "enriched_orders") # Step 4: Trigger execution (global optimization, then launch) pw.run() ``` All operations before `pw.run()` simply construct the computation DAG (Directed Acyclic Graph). When `pw.run()` is called, the Pathway engine performs global optimizations (operator fusion, plan reordering, etc.) before launching the Rust execution engine. This resembles Spark's lazy evaluation philosophy but differs fundamentally at the execution layer. ---

Head-to-Head:

Pathway vs. Flink vs. Spark | Dimension | Pathway | Apache Flink | Spark Streaming | |-----------|---------|-------------|-----------------| | Programming Language | Python (Rust execution layer) | Java/Scala/Python | Python/Scala/R | | Execution Engine | Rust + Differential Dataflow | JVM | JVM | | Incremental Computation | Native (core feature) | Manual state management | Micro-batch (not truly incremental) | | Out-of-Order Handling | Automatic (mathematical guarantee) | Watermark mechanism | Batch boundary handling | | Deployment Complexity | Low (single process/Docker/K8s) | High (cluster required) | High (cluster required) | | RAG/LLM Support | Native, built-in | Third-party integration | Third-party integration | | Consistency Model | At-least-once (free) / Exactly-once (enterprise) | Exactly-once | Exactly-once | | Unified Batch & Streaming | Fully unified (same API) | Partially (DataStream + Table API) | Requires API switching | | Learning Curve | Low (Python-native) | High | Medium | | Ecosystem Maturity | Growing | Very mature | Very mature | **Pathway shines when**: you're a Python-native team building AI/ML pipelines, need rapid development cycles, or are building small-to-medium real-time applications without a dedicated infrastructure team. **Flink wins at**: massive-scale production environments, complex CEP (Complex Event Processing), and scenarios requiring bulletproof Exactly-once guarantees with extensive operational tooling. **Spark Streaming is best for**: organizations already invested in the Spark ecosystem, unified batch/streaming analytics workloads, and teams with existing Scala/Java expertise. ---

Real-Time RAG: Pathway's Killer

Feature #

The Traditional RAG Problem

In conventional RAG architectures, the document update pipeline typically looks like: ``` Document changes → Trigger batch rebuild → Re-generate all embeddings → Write to vector DB ``` This approach has fundamental limitations: high latency (minutes to hours), massive resource consumption (full recomputation), and poor freshness guarantees. #

Pathway's Solution: Incremental Vector Indexing

Pathway's built-in `pathway.stdlib.indexing` module provides real-time incremental vector indexing: ```python import pathway as pw from pathway.stdlib.indexing import VectorStoreServer from pathway.xpacks.llm.embedders import OpenAIEmbedder from pathway.xpacks.llm.parsers import UnstructuredParser # Read documents (supports Google Drive, SharePoint, local files, S3, etc.) documents = pw.io.gdrive.read( object_id="your-folder-id", mode="streaming" ) # Automatically parse and chunk documents parser = UnstructuredParser(chunking_mode="elements") parsed = documents.select( chunks=parser(pw.this.data), metadata=pw.this.metadata ).flatten(pw.this.chunks) # Build real-time vector index vector_server = VectorStoreServer( parsed, embedder=OpenAIEmbedder(model="text-embedding-3-small"), ) # Start server (simultaneously monitors document changes + serves queries) vector_server.run_server(host="0.0.0.0", port=8666) ``` When a file in Google Drive changes: 1. Pathway's streaming connector detects the modification 2. Only the changed documents are re-parsed and re-embedded 3. The vector index is updated incrementally (Differential Dataflow guarantees consistency) 4. Subsequent queries immediately hit the fresh content The entire process **requires no full index rebuild**, reducing latency from minutes to seconds. #

Supported RAG Patterns Pathway's

LLM xpack ships with ready-to-run templates for multiple RAG patterns: - **Standard RAG**: Document retrieval → LLM generation, the classic approach optimized with real-time indexing - **Adaptive RAG**: Automatically selects retrieval strategy based on query complexity—simple queries skip retrieval, complex ones use multi-hop retrieval - **Multimodal RAG**: Supports GPT-4V and other multimodal models for documents containing images, charts, and tables - **Private RAG**: Fully local deployment using Ollama + Mistral AI—zero data leaves your infrastructure #

Real-Time Unstructured Data to SQL

Another powerful pattern is transforming unstructured documents into structured SQL tables on-the-fly: ```python # LLM extracts structured data from documents as they stream in structured = documents.select( extracted=extract_json_from_text( pw.this.content, schema={"product_name": str, "price": float, "quantity": int} ) ) pw.io.postgres.write(structured, pg_settings, "structured_products") ``` ---

The

350+ Connector Ecosystem Pathway ships with comprehensive connectors covering virtually every modern data source: **Message Queues**: Apache Kafka, Redpanda, AWS Kinesis, Google Pub/Sub, Azure Event Hubs **Databases**: PostgreSQL (streaming CDC), MySQL, MongoDB (via Airbyte) **Cloud Storage**: AWS S3, Google Cloud Storage, Azure Blob Storage **Collaboration Tools**: Google Drive (real-time change detection), SharePoint (license required) **File Systems**: Local filesystem, SFTP, FTP (with streaming mode) **APIs/Network**: HTTP polling connectors, WebSocket **Via Airbyte Integration**: 300+ additional sources (Salesforce, HubSpot, Zendesk, etc.) The Airbyte integration deserves special mention. With a single configuration block, Pathway connects to any of Airbyte's 300+ sources and handles CDC automatically: ```python # Connect to any 300+ data source via Airbyte data = pw.io.airbyte.read( source_config={ "sourceType": "salesforce", "client_id": "your-client-id", "client_secret": "your-client-secret", "refresh_token": "your-refresh-token", }, streams=["Account", "Opportunity", "Lead"] ) ``` ---

Production

Deployment #

Single Process

(Development & Small Scale) For development and small-to-medium production workloads, Pathway runs as a single Python process: ```bash pip install pathway python my_pipeline.py ``` The Rust engine handles multithreading internally, so a single process can saturate modern multi-core CPUs. #

Docker Deployment

Pathway provides official Docker images with all dependencies pre-bundled: ```dockerfile FROM pathwaycom/pathway:latest COPY requirements.txt /app/requirements.txt RUN pip install -r /app/requirements.txt COPY . /app WORKDIR /app CMD ["python", "pipeline.py"] ``` #

Kubernetes Deployment

For large-scale deployments, Pathway integrates with Kubernetes and supports OpenTelemetry for observability: - Horizontal scaling via multiple processes/nodes - Built-in Prometheus metrics endpoint - Persistent state support for fast recovery after failures - Full compatibility with existing K8s monitoring stacks (Grafana, Datadog, etc.) #

Persistence and Fault Tolerance

Pathway's persistence API snapshots computation state to disk: ```python pw.run( persistence_config=pw.persistence.Config( backend=pw.persistence.Backend.filesystem("/data/state"), persistence_mode=pw.PersistenceMode.PERSISTING, snapshot_interval_ms=30000, # snapshot every 30 seconds ) ) ``` After a restart, Pathway recovers from the latest snapshot, avoiding full stream replay. The enterprise version additionally provides Exactly-once semantics, ensuring no data is processed twice even after failure recovery. ---

Licensing Model

Pathway uses a **Business Source License (BSL)**: | Feature | Free Version | Enterprise Version | |---------|-------------|-------------------| | Core stream processing | ✅ | ✅ | | 350+ connectors | ✅ (most) | ✅ (all) | | Consistency | At-least-once | Exactly-once | | SharePoint connector | ❌ | ✅ | | Advanced monitoring | Limited | Full dashboard | | SLA support | Community | Enterprise SLA | Obtaining a free license key requires only a simple registration on pathway.com, unlocking additional monitoring and analytics features. ---

Who Should Use Pathway?

Pathway is an excellent fit for: - **AI/ML engineers** building LLM applications that need fresh data (RAG, agents, real-time recommendations) - **Python-native data teams** who want production-grade streaming without learning Java/Scala - **Startups and scale-ups** that can't afford dedicated Flink/Spark infrastructure teams - **Organizations with diverse data sources** that benefit from Pathway's extensive connector ecosystem Consider sticking with Flink/Spark if: - You're already deeply invested in the JVM streaming ecosystem - You need Exactly-once guarantees at petabyte scale (enterprise Pathway is an option too) - Your team has strong Java/Scala expertise ---

Conclusion

Pathway represents a new direction for stream processing: **enabling AI/ML engineers to build production-grade real-time pipelines using familiar Python paradigms without sacrificing performance**. Its five core value propositions: 1. **Python-native, no JVM overhead**: Eliminate the Java/Scala learning curve 2. **Differential Dataflow incremental computation**: Elegantly solve out-of-order and state consistency challenges 3. **Fully unified batch and streaming**: Test on batch data in development, flip to streaming in production with zero code changes 4. **Native RAG integration**: Keep LLM applications grounded in always-fresh data 5. **350+ connectors**: Connect to any data source in a few lines of code For Python-ecosystem data engineers and AI engineers, Pathway is worth serious evaluation—especially when building real-time AI pipelines without wanting to introduce JVM complexity.

Sources

GitHub