Cortex

Self-hosted RAG engine with hybrid semantic + lexical retrieval

PythonFastAPINext.js 15React 19OpenAI APIFAISSSQLiteDockerAWSTypeScriptTailwind CSS
Cortex RAG Architecture: Hybrid Retrieval System combining Semantic Search (FAISS vectors), BM25 Lexical Search, and Reciprocal Rank Fusion (RRF) for optimal document retrieval across multi-format documents

Overview

Cortex is a self-hosted Retrieval Augmented Generation (RAG) system that enables semantic understanding and querying across multi-format documents. It combines intelligent memory management, hybrid retrieval strategies, and confidence scoring to deliver accurate, context-aware responses.

Problem

Organizations struggle to extract insights from large document collections. Traditional search methods fail to understand semantic meaning, and manual document analysis is time-consuming and inconsistent. Teams spend hours reviewing documents to find relevant information, leading to inefficient workflows and missed insights.

Solution

Cortex implements a production-ready RAG architecture with three core innovations:

  1. Intelligent Memory System: Maintains context-aware sessions with conversation history, allowing for multi-turn interactions that build on previous queries.

  2. Hybrid Retrieval: Combines Semantic Search (FAISS vectors), BM25 Lexical Search, and Reciprocal Rank Fusion (RRF) to maximize retrieval accuracy across different query types.

  3. Confidence Scoring: Provides visual confidence badges and source attribution, helping users understand the reliability of generated answers.

Architecture

The system follows a modular architecture:

  • Backend: FastAPI serves as the API layer, handling document processing, embedding generation, and LLM interactions
  • Vector Store: FAISS enables fast similarity search across document embeddings
  • Database: SQLite stores session metadata and conversation history
  • Frontend: Next.js 15 with React 19 provides a responsive, modern interface
  • Deployment: Docker Compose orchestrates services with Nginx as a reverse proxy on AWS

Technical Breakdown

Key Technologies

  • Python for backend processing and AI integration
  • FastAPI for high-performance API endpoints
  • FAISS for efficient vector similarity search
  • OpenAI API for embeddings and LLM inference
  • Next.js 15 with React 19 for the frontend
  • Docker for containerization and deployment
  • AWS EC2 for hosting

Challenges Solved

  1. Document Chunking: Implemented intelligent chunking that preserves context while respecting token limits
  2. Hybrid Search: Balanced semantic and lexical retrieval using RRF to handle both conceptual and keyword queries
  3. Memory Management: Built session-based memory that maintains conversation context without storing full histories
  4. Confidence Scoring: Developed a scoring mechanism that evaluates retrieval quality and answer relevance

Results

  • 95%+ accuracy in document retrieval and question answering
  • Time reduction: Document analysis time reduced from hours to minutes
  • Scalable architecture: Handles large document collections with efficient vector search
  • Production-ready: Containerized deployment enables easy setup and maintenance

What I Learned

Building Cortex taught me the importance of balancing semantic and lexical search strategies. The hybrid approach significantly outperformed either method alone. Additionally, implementing session-based memory required careful design to balance context retention with performance.

Next Steps

Future enhancements could include:

  • Support for additional document formats (Markdown, CSV)
  • Multi-tenant architecture for enterprise deployments
  • Advanced caching strategies for faster response times
  • Integration with more LLM providers for flexibility