LLM-Powered Assessment Engine

AI evaluation system for generating, grading, and correcting free-form responses

PythonLLM SystemsEvaluation SystemsAssessment Automation

LLM Examiner: Building Trustworthy AI-Powered Exam Preparation

1. Problem Framing

Students preparing for high-stakes exams face a fundamental problem: passive learning doesn't test mastery. Reading notes and watching videos creates the illusion of understanding, but real exams require active recall, structured reasoning, and time management under pressure.

Traditional exam prep tools fail because they either:

  • Provide content without evaluation (passive consumption)
  • Offer generic practice questions (not exam-specific)
  • Lack realistic grading (too lenient or inconsistent)

What's at stake: Incorrect grading destroys trust. A student who receives inflated scores will enter the real exam unprepared. A system that grades inconsistently becomes unusable. Evaluation is harder than generation—any LLM can create questions, but grading requires strict adherence to rubrics, understanding of partial credit, and resistance to hallucination.

The core risk: LLMs are probabilistic. Without explicit constraints, they will:

  • Generate questions outside the syllabus
  • Grade inconsistently across similar answers
  • Provide feedback that contradicts grading
  • Hallucinate concepts not in the source material

This system needed to be trustworthy enough that students could rely on it for exam preparation, not just practice.

2. Constraints & Non-Goals

Explicit constraints prevent the system from breaking in predictable ways:

No speculative question generation. Every question must be grounded in the provided syllabus and past papers. The LLM cannot invent topics, difficulty levels, or question formats not present in the source material.

No opaque scoring. Every grade must reference an explicit rubric. Students must understand why they lost marks, not receive a score with vague feedback.

No "AI vibes." The system should feel deterministic where possible. Question serving is instant (in-memory), session structure is predictable (always 5 MCQ + 15 open-ended), and grading follows explicit rules.

No context re-sending per question. The full syllabus (often 10k+ tokens) cannot be sent to the LLM for each of 20 questions. This would cost $2-5 per session and introduce 20+ seconds of latency.

No real-time LLM calls for question serving. Once a session starts, questions must be served instantly from memory. Any LLM call during question serving breaks the exam flow.

No mutable sessions after completion. Once all questions are answered, the session becomes immutable. This prevents data integrity issues, cheating (retrying questions), and enables clean audit trails.

What we refused to build:

  • Adaptive difficulty (adds complexity without proven value)
  • Multi-user collaboration (out of scope for MVP)
  • Real-time leaderboards (distraction from learning)
  • Speculative "AI tutor" features (untrustworthy without grounding)

These constraints signal: "I know where systems break, and I've drawn boundaries to prevent it."

3. System Architecture

The system is built around a clear boundary: deterministic core, LLM at the edges.

LLM Examiner System Architecture

Why these boundaries exist:

S3 as input boundary: Documents are processed once (PDF extraction, OCR) and stored in S3. The deterministic core never touches raw documents—it only reads processed text. This separates ingestion (can fail, can be slow) from question serving (must be fast, must be reliable).

In-memory session storage: Sessions are stored in memory for zero-latency question serving. This is a deliberate MVP tradeoff—we acknowledge data loss risk but prioritize user experience. Logs to disk provide audit trail.

LLM boundary: LLMs are called at three points: batch generation (once per session), grading (per answer), and post-session analysis (once per session). Everything else is deterministic Python code.

Question batching: Questions are generated in one LLM call and cached. The frontend never waits for LLM during question serving—it's pure data retrieval.

This architecture ensures that LLM failures (timeouts, rate limits, hallucinations) are contained and don't break the core exam flow.

4. Key Design Decisions

Decision 1: Batch Precomputation (5 MCQ + 15 Open-Ended)

What we chose: Generate all 20 questions in a single LLM call at session start. Fixed mix: 5 multiple-choice questions followed by 15 open-ended questions.

Why we rejected alternatives:

  • Per-question generation: Would cost 20x more ($2-5 vs $0.10-0.30 per session) and introduce 20+ seconds of latency. Each question would require sending the full context (10k+ tokens).
  • Dynamic question count: Unpredictable session length breaks user expectations and makes progress tracking impossible.
  • All-one-type questions: Real exams mix question types. Pure MCQ doesn't test reasoning; pure open-ended is too slow for practice.

Risk mitigated: Cost explosion (75% reduction), inconsistent sessions (always same structure), and latency spikes (questions served instantly from memory).

Implementation detail: The LLM is explicitly instructed to generate exactly 5 MCQ followed by 15 open-ended. Post-processing validates structure and reorders if needed. This is the only place where we tolerate LLM non-compliance—we detect and fix it.

Decision 2: Session Sealing (Immutable on Completion)

What we chose: Once all 20 questions are answered, the session becomes immutable (sealed: true). No new answers can be submitted, questions cannot be modified, and the session is marked completed.

Why we rejected alternatives:

  • Mutable sessions: Allows cheating (retry questions after seeing feedback), breaks audit trails, and makes analytics unreliable.
  • Partial completion: Unclear semantics—is a session with 10/20 questions "complete"? Sealing only on full completion creates a clear boundary.

Risk mitigated: Data integrity (no retroactive changes), cheating prevention (can't retry after feedback), and clean analytics (completed sessions are a known state).

Implementation detail: The seal_session() function sets status: "completed", sealed: true, and completed_at: timestamp. All endpoints check sealed before allowing modifications. This is a simple flag that creates a powerful invariant.

Decision 3: Evaluation Before Feedback

What we chose: Grade the answer first (with explicit rubric), then generate feedback separately. These are two distinct LLM calls.

Why we rejected alternatives:

  • Combined grading+feedback: Creates feedback bias—if the LLM generates feedback first, it may grade to match the feedback rather than the rubric. Also prevents cost control (can't skip feedback for perfect answers).
  • Feedback before grading: Breaks the mental model—students expect to see their score first, then understand why.

Risk mitigated: Grading consistency (rubric is primary, feedback is secondary), cost control (can skip feedback for high scores), and separation of concerns (grading is evaluation, feedback is teaching).

Implementation detail: The grader receives only the question and student answer—no context, no syllabus. This forces rubric-based grading. The autopsy engine receives the grading result and generates feedback. This separation is explicit in the code.

Decision 4: Question Type Separation

What we chose: Fixed mix of 5 MCQ + 15 open-ended, with explicit question_type field on every question. Frontend renders differently based on type (radio buttons vs textarea).

Why we rejected alternatives:

  • Dynamic detection: Detecting question type from context is unreliable. The LLM might generate MCQ when we want open-ended, or vice versa.
  • All-one-type: Doesn't match real exams, which mix question types for different assessment goals.

Risk mitigated: Frontend rendering errors (explicit type prevents guessing), user confusion (consistent structure), and LLM non-compliance (we enforce type in post-processing).

Implementation detail: Every question has question_type: "multiple_choice" | "open_ended". The batch generator explicitly requests this mix. Post-processing validates and enforces it. The frontend checks question.question_type === 'multiple_choice' to render radio buttons.

Decision 5: Time Tracking Per Question

What we chose: Track started_at and submitted_at timestamps for each question, stored in the session's answers array.

Why we rejected alternatives:

  • Session-level timing only: Loses granularity. Can't detect time pressure patterns (low scores with short time) or identify which questions take too long.
  • No timing: Misses a critical signal. Time pressure is a real exam factor that affects performance.

Risk mitigated: Missing failure patterns (time pressure failures are invisible without per-question timing), incomplete analytics (can't calculate average time per question type), and poor user insights (students don't know where they're slow).

Implementation detail: Frontend records started_at when question is displayed, submitted_at when answer is submitted. Backend calculates time_spent_seconds and stores it with the answer. Pattern analyzer uses this to identify time pressure failures.

Decision 6: S3 for Context Storage

What we chose: Store syllabus and past papers in S3, organized by {user_id}/{subject}/context.txt. Mobile app uploads documents (PDF/OCR) which are processed and stored.

Why we rejected alternatives:

  • Local files: Doesn't scale, breaks multi-device access, and creates deployment complexity.
  • Database storage: Overkill for large text blobs, harder to version, and adds query complexity.

Risk mitigated: Scalability (S3 handles large files), multi-device access (mobile app and web app share same context), and deployment simplicity (no database migrations for content).

Implementation detail: The S3 client uses boto3 with automatic region detection. Documents are appended to existing context (allows incremental updates). The generator reads from S3 once per session start, not per question.

Decision 7: Post-Session Analysis (Summary to Patterns to Autopsy)

What we chose: Three-phase analysis: summary (statistics), patterns (failure analysis), and autopsy (detailed feedback for weak questions). Each is a separate endpoint and LLM call.

Why we rejected alternatives:

  • Single comprehensive analysis: Too expensive ($1-2 per session), too slow (10+ seconds), and information overload (students don't need everything at once).
  • No analysis: Misses the core value proposition—students need to understand their weaknesses.

Risk mitigated: Cost control (can skip patterns/autopsy if not needed), latency (summary is fast, detailed analysis is on-demand), and information architecture (progressive disclosure).

Implementation detail: Summary is computed from session data (no LLM). Patterns use one LLM call to analyze all answers. Autopsy batches weak questions (less than 70% marks) into one LLM call. This creates a clear cost and latency hierarchy.

5. Failure Modes & Safeguards

LLM Doesn't Follow Question Type Instructions

What can go wrong: LLM generates 3 MCQ and 17 open-ended, or marks open-ended questions as MCQ.

How we detect it: Post-processing validates structure. We categorize questions by presence of options and correct_answer fields, then check question_type field.

What happens next: We reorder questions (MCQ first, then open-ended), enforce question_type explicitly, and log a warning. The session continues with corrected structure. This is the only place where we tolerate and fix LLM non-compliance.

LLM Generates Fewer Than 20 Questions

What can go wrong: LLM returns 15 questions instead of 20, or JSON parsing fails entirely.

How we detect it: Count validation after parsing: if len(questions) < num_questions.

What happens next: We log a warning and return what we have (graceful degradation). The session continues with fewer questions. In production, we'd retry or alert, but for MVP we accept partial sessions.

Session Data Lost (In-Memory Storage)

What can go wrong: Server restart, crash, or deployment wipes all sessions from memory.

How we detect it: GET /api/session/{session_id}/next returns 404 "Session not found".

What happens next: This is an acknowledged MVP limitation. We log all attempts to disk (session_{session_id}.txt), so sessions can be partially reconstructed. In production, we'd use a database. For now, we're explicit about the tradeoff: zero-latency question serving vs. persistence.

Grading Inconsistency

What can go wrong: Same answer receives different scores across sessions, or grading doesn't match rubric.

How we detect it: Explicit rubrics in every question, structured JSON output from grader, and "strict mode" flag that enforces rubric adherence.

What happens next: The grader receives only question + answer (no context, no syllabus). This forces rubric-based grading. If grading is inconsistent, it's a prompt engineering problem, not a system architecture problem. We log all grading results for analysis.

Document Extraction Fails

What can go wrong: PDF has no extractable text (scanned images), OCR produces gibberish, or extraction returns empty string.

How we detect it: Quality checks: minimum 50 characters, minimum 10 words. Warnings for content less than 200 characters.

What happens next: Clear error messages to user, manual text entry fallback. The system doesn't proceed with invalid content—this prevents garbage-in-garbage-out.

OCR Unavailable

What can go wrong: Tesseract not installed on server, or OCR library fails to import.

How we detect it: try/except ImportError around pytesseract import. TESSERACT_AVAILABLE flag.

What happens next: Graceful degradation. /api/extract-image returns 503 with clear message: "OCR service not configured. Please enter text manually." The mobile app shows this message and enables manual entry.

6. Outcomes & Learnings

Operational Outcomes

75% cost reduction via batch precomputation. Generating 20 questions in one call costs ~$0.10-0.30 vs. $0.60-1.40 for sequential generation. This makes the system viable at scale.

Zero-question-serving latency. Questions are served from in-memory cache. Frontend never waits for LLM during exam flow. This creates a realistic exam experience.

Predictable session structure. Every session is exactly 5 MCQ + 15 open-ended. Students know what to expect, progress tracking is consistent, and analytics are comparable across sessions.

Trustworthy grading. Explicit rubrics, no context in grading calls, and structured output prevent hallucination. Students trust the scores because they can see the reasoning.

Clear upgrade paths. Session sealing creates clean boundaries for analytics. We can add features (multi-session progress, difficulty calibration) without breaking existing sessions.

Learnings

Batch generation is more reliable than sequential. When generating 20 questions at once, the LLM sees the full context and creates a coherent set. Sequential generation often produces repetitive or inconsistent questions.

Explicit question_type prevents frontend bugs. Without it, the frontend would guess based on structure (if (question.options)), which breaks when LLM output varies. Explicit types create a contract.

Session sealing creates clean audit boundaries. Once sealed, a session is a known state. This enables analytics, prevents cheating, and makes the system feel deterministic even though LLMs are probabilistic.

Time tracking reveals patterns not visible in scores alone. A student who scores 60% in 2 minutes vs. 60% in 20 minutes has different problems. Time pressure failures are a distinct failure mode that requires different interventions.

Post-processing is essential. LLMs don't always follow instructions. We validate, reorder, and enforce structure. This is the difference between a prototype and a production system.

Separation of concerns reduces risk. Grading and feedback are separate. This prevents feedback bias, allows cost control, and makes each component testable independently.

What I'd Do Next

Persistent Session Storage

Replace in-memory storage with a database (PostgreSQL or DynamoDB). Sessions are too valuable to lose on server restart. This is the highest-priority upgrade.

Tradeoff: Acceptable latency increase (10-50ms per question fetch) for persistence. Use connection pooling and read replicas to minimize impact.

Question Difficulty Calibration

Track success rates per question across all sessions. Questions with less than 20% success rate are too hard; greater than 90% success rate are too easy. Use this to calibrate difficulty in future generations.

Second-order effect: Better question quality over time, adaptive difficulty per student, and data-driven syllabus gaps (topics where all questions are too hard).

Adaptive Question Selection

Instead of fixed 5 MCQ + 15 open-ended, select questions based on student's weak topics (from pattern analysis). If a student consistently fails thermodynamics questions, generate more thermodynamics questions.

Risk: Breaks session predictability. Mitigate by keeping fixed structure but biasing topic selection within that structure.

Multi-Session Analytics

Track progress across sessions. Show improvement over time, identify topics where student is getting better/worse, and recommend focus areas.

Requirement: Persistent storage (see above). This is where session sealing pays off—completed sessions are clean data points.

A/B Testing Framework

Test different grading strategies (strict vs. lenient), feedback styles (detailed vs. concise), and question generation prompts. Measure impact on student outcomes.

Infrastructure: Feature flags, session metadata (which variant), and analytics pipeline to compare variants.

Uncertainty Handling

LLMs are probabilistic. When grading is borderline (e.g., 4.5/10), explicitly communicate uncertainty: "This answer is between 4-5 marks. Here's what would push it to 5..."

Value: Builds trust through transparency. Students understand that AI grading has limits, and they know when to question the score.


Conclusion

This system works because it acknowledges LLM limitations and builds around them. We don't pretend LLMs are deterministic—we create deterministic boundaries. We don't ignore cost—we optimize it through batching. We don't hide risks—we document failure modes and safeguards.

The result is a system that founders trust (clear constraints, explicit risks) and engineers respect (sound decisions, proper boundaries). It's not perfect, but it's production-ready.

Key Features & Capabilities

Free-form response generation

Automated grading and correction

Verification and feedback reliability

Learning under ambiguity