Cortex
Self-hosted RAG engine with hybrid semantic + lexical retrieval

Cortex: Production-Ready RAG System
Overview
Cortex is a self-hosted Retrieval Augmented Generation (RAG) system that enables semantic understanding and querying across multi-format documents. It combines intelligent memory management, hybrid retrieval strategies, and confidence scoring to deliver accurate, context-aware responses.
Problem
Organizations struggle to extract insights from large document collections. Traditional search methods fail to understand semantic meaning, and manual document analysis is time-consuming and inconsistent. Teams spend hours reviewing documents to find relevant information, leading to inefficient workflows and missed insights.
Solution
Cortex implements a production-ready RAG architecture with three core innovations:
-
Intelligent Memory System: Maintains context-aware sessions with conversation history, allowing for multi-turn interactions that build on previous queries.
-
Hybrid Retrieval: Combines Semantic Search (FAISS vectors), BM25 Lexical Search, and Reciprocal Rank Fusion (RRF) to maximize retrieval accuracy across different query types.
-
Confidence Scoring: Provides visual confidence badges and source attribution, helping users understand the reliability of generated answers.
Architecture
The system follows a modular architecture:
- Backend: FastAPI serves as the API layer, handling document processing, embedding generation, and LLM interactions
- Vector Store: FAISS enables fast similarity search across document embeddings
- Database: SQLite stores session metadata and conversation history
- Frontend: Next.js 15 with React 19 provides a responsive, modern interface
- Deployment: Docker Compose orchestrates services with Nginx as a reverse proxy on AWS
Technical Breakdown
Key Technologies
- Python for backend processing and AI integration
- FastAPI for high-performance API endpoints
- FAISS for efficient vector similarity search
- OpenAI API for embeddings and LLM inference
- Next.js 15 with React 19 for the frontend
- Docker for containerization and deployment
- AWS EC2 for hosting
Challenges Solved
- Document Chunking: Implemented intelligent chunking that preserves context while respecting token limits
- Hybrid Search: Balanced semantic and lexical retrieval using RRF to handle both conceptual and keyword queries
- Memory Management: Built session-based memory that maintains conversation context without storing full histories
- Confidence Scoring: Developed a scoring mechanism that evaluates retrieval quality and answer relevance
Results
- 95%+ accuracy in document retrieval and question answering
- Time reduction: Document analysis time reduced from hours to minutes
- Scalable architecture: Handles large document collections with efficient vector search
- Production-ready: Containerized deployment enables easy setup and maintenance
What I Learned
Building Cortex taught me the importance of balancing semantic and lexical search strategies. The hybrid approach significantly outperformed either method alone. Additionally, implementing session-based memory required careful design to balance context retention with performance.
Next Steps
Future enhancements could include:
- Support for additional document formats (Markdown, CSV)
- Multi-tenant architecture for enterprise deployments
- Advanced caching strategies for faster response times
- Integration with more LLM providers for flexibility
Key Features & Capabilities
Intelligent Memory System for context-aware sessions and conversation history
Hybrid Retrieval combining Semantic Search, BM25 Lexical Search, and Reciprocal Rank Fusion
Confidence Scoring with source attribution and visual confidence badges
Multi-format document pipeline (PDF/DOCX/TXT) with session-based metadata
Modular LLM integration supporting multiple models and providers
Vector similarity search using FAISS for semantic document retrieval
Containerized deployment with Docker Compose and Nginx
Responsive web interface with modern UX patterns
