LLM-Powered Assessment Engine

1. Problem Framing

Students preparing for high-stakes exams face a fundamental problem: passive learning doesn't test mastery. Reading notes and watching videos creates the illusion of understanding, but real exams require active recall, structured reasoning, and time management under pressure.

Traditional exam prep tools fail because they either:

Provide content without evaluation (passive consumption)
Offer generic practice questions (not exam-specific)
Lack realistic grading (too lenient or inconsistent)

What's at stake: Incorrect grading destroys trust. A student who receives inflated scores will enter the real exam unprepared. A system that grades inconsistently becomes unusable. Evaluation is harder than generation—any LLM can create questions, but grading requires strict adherence to rubrics, understanding of partial credit, and resistance to hallucination.

The core risk: LLMs are probabilistic. Without explicit constraints, they will:

Generate questions outside the syllabus
Grade inconsistently across similar answers
Provide feedback that contradicts grading
Hallucinate concepts not in the source material

This system needed to be trustworthy enough that students could rely on it for exam preparation, not just practice.

2. Constraints & Non-Goals

Explicit constraints prevent the system from breaking in predictable ways:

No speculative question generation. Every question must be grounded in the provided syllabus and past papers. The LLM cannot invent topics, difficulty levels, or question formats not present in the source material.

No opaque scoring. Every grade must reference an explicit rubric. Students must understand why they lost marks, not receive a score with vague feedback.

No "AI vibes." The system should feel deterministic where possible. Question serving is instant (in-memory), session structure is predictable (always 5 MCQ + 15 open-ended), and grading follows explicit rules.

No context re-sending per question. The full syllabus (often 10k+ tokens) cannot be sent to the LLM for each of 20 questions. This would cost $2-5 per session and introduce 20+ seconds of latency.

No real-time LLM calls for question serving. Once a session starts, questions must be served instantly from memory. Any LLM call during question serving breaks the exam flow.

No mutable sessions after completion. Once all questions are answered, the session becomes immutable. This prevents data integrity issues, cheating (retrying questions), and enables clean audit trails.

What we refused to build:

Adaptive difficulty (adds complexity without proven value)
Multi-user collaboration (out of scope for MVP)
Real-time leaderboards (distraction from learning)
Speculative "AI tutor" features (untrustworthy without grounding)

These constraints signal: "I know where systems break, and I've drawn boundaries to prevent it."

3. System Architecture

The system is built around a clear boundary: deterministic core, LLM at the edges.

Why these boundaries exist:

S3 as input boundary: Documents are processed once (PDF extraction, OCR) and stored in S3. The deterministic core never touches raw documents—it only reads processed text. This separates ingestion (can fail, can be slow) from question serving (must be fast, must be reliable).

In-memory session storage: Sessions are stored in memory for zero-latency question serving. This is a deliberate MVP tradeoff—we acknowledge data loss risk but prioritize user experience. Logs to disk provide audit trail.

LLM boundary: LLMs are called at three points: batch generation (once per session), grading (per answer), and post-session analysis (once per session). Everything else is deterministic Python code.

Question batching: Questions are generated in one LLM call and cached. The frontend never waits for LLM during question serving—it's pure data retrieval.

This architecture ensures that LLM failures (timeouts, rate limits, hallucinations) are contained and don't break the core exam flow.

4. Key Design Decisions

Decision 1: Batch Precomputation (5 MCQ + 15 Open-Ended)

What we chose: Generate all 20 questions in a single LLM call at session start. Fixed mix: 5 multiple-choice questions followed by 15 open-ended questions.