Education / EdTech Managed Custom AI Agents University of Colorado · May 25, 2026

Custom AI Grading Software for Dental Schools

Custom AI grading software built with the University of Colorado. A multi-model AI agent grades dental crown preps against configurable rubrics, in production and scaling to more dental schools.

Geography: Colorado, United States
Year: 2026
Stage: Student-led EdTech project, scaling to multiple dental schools
Team: 1 senior full-stack engineer
Duration: Fixed-scope build, single phase, ongoing collaboration

The situation

Evan Menke, a student at the University of Colorado, had already proven the hard part. Using AI in a personal capacity, he was evaluating dental crown preparations, the physical reproductions dental students carve during training. A student finishes a crown, photographs it, runs it past the AI, and gets a numeric score against a rubric plus an explanation of where it fell short. Faculty time saved, and students self-correct between formal reviews. What he did not have was AI grading software: a real platform that could serve students across multiple dental schools rather than one person's workflow.

Getting there meant moving from a single-user experiment to a backend that could handle multiple rubrics, switch between LLM providers as they evolve, store scoring history per student, and integrate with a frontend faculty and students could actually use. No off-the-shelf grading tool covers that combination, especially for photographed physical work rather than essays or quizzes.

Evan brought Leanware in to build the evaluation engine, the rubric system, and the multi-model orchestration that would keep the platform accurate as the underlying LLM ecosystem moved. The frontend already lived on Base44; the work was everything behind it.

What we built

The engagement was scoped as a managed custom AI agent: Leanware builds, deploys, and operates the system; the client uses it. Fixed scope, one senior full-stack engineer, with the option to extend into managed-service operation once the platform reaches steady state.

An evaluation engine built to outlive any single model

The backend runs on FastAPI with LangGraph orchestrating the LLM workflows and PostgreSQL holding rubrics, evaluation history, and scoring metadata. The provider layer abstracts across OpenRouter, OpenAI direct, Google, and Anthropic, so the platform can swap models per rubric or per evaluation and run cross-provider comparisons without code changes. When a better model ships, the grading software adopts it instead of being stranded on last year's choice.

Rubrics the client controls

The configuration surface is the part of AI grading software that buyers tend to underestimate. Whoever owns the rubric can configure the base prompt the LLM sees, the rubric structure (which criteria count, and in what proportion), the score weights per criterion, and the thresholds at which the system flags a critical finding versus a minor one. Dental rubrics evolve as faculty refine their teaching, and the same backend handles each new shape without a rebuild.

Grading that shows its work

For each evaluated image, the system returns percentage-based coordinate highlights for the regions the LLM scored against. The frontend overlays the model's reasoning directly on the student's photo, turning an abstract score into a specific "here is where the margin is wrong" annotation.

Tested before it graded anyone

Deployment runs from GitHub Actions to Render.com via Docker images. Before launch, Leanware ran accuracy testing across providers and delivered a comparative report covering which models scored highest on rubric adherence, which produced the most stable outputs across image variations, and which combinations were cost-effective at scale. Documentation and code shipped at a standard that supports ongoing maintenance and a future expansion into training a custom model specialized for dental rubrics.

Outcome

Multi-provider AI evaluation backend in production across OpenRouter, OpenAI, Google, and Anthropic
Initial model comparison report delivered, covering accuracy and cost per provider
The AI agent keeps improving as it sees more cases
From client quote

The backend launched against the Base44 frontend and is in active use evaluating crown preparations. Multi-provider switching works in production, and the rubric and prompt configuration surface lets Evan refine the grading system as more student work flows through it. The model comparison report stands as a published reference for which provider performs best on which rubric type.

The collaboration continues on a smaller cadence. As evaluation samples accumulate, the path opens to training a custom model specialized for dental crown rubrics, which would lift accuracy beyond what general-purpose LLMs produce today.

"I presented this to my clinic and the response was strong. Leanware did excellent work, and we're going to use feedback from real evaluations to keep refining the AI agent, improving grading accuracy organically as it sees more cases."
— Evan Menke , University of Colorado · United States

Engagement line

Managed Custom AI Agents

Engagement FAQ

Can AI grading software evaluate physical work like dental crown preparations?

Yes. In the University of Colorado platform, a student photographs the finished crown preparation and the AI agent scores it against a configurable rubric, returning a numeric score, an explanation of where it fell short, and coordinate highlights over the exact regions it scored. The pattern applies to any assessment where the work can be photographed.

How is custom AI grading software different from AI grading tools for teachers?

Off-the-shelf AI grading tools ship with their own rubric assumptions, mostly around text. The Colorado platform needed configurable rubric structures, per-criterion score weights, client-controlled prompts, and the ability to switch LLM providers per evaluation. That level of control over how grading works is what pushes a project from a tool subscription to custom AI grading software.

How does AI grading stay accurate as LLM models change?

The platform abstracts across OpenRouter, OpenAI, Google, and Anthropic, so models can be swapped per rubric or per evaluation without code changes. Before launch, Leanware delivered a model comparison report covering rubric adherence, output stability across image variations, and cost per provider, giving the client a factual basis for which model grades which rubric.

What does AI in dental education look like in practice?

At the University of Colorado, dental students photograph their crown preparations and get an immediate rubric-based score with an explanation of what to fix, so they self-correct between formal reviews instead of waiting for faculty availability. Faculty time goes to teaching rather than repetitive first-pass evaluation.

How do you benchmark AI model performance for image-based grading?

Test across providers with real student submissions, track rubric adherence, measure output stability across lighting and resolution variations, and compare cost per evaluation at scale. That benchmarking produced the comparative report the Colorado platform launched with.

Can an AI agent for education integrate with an existing frontend?

Yes. The Colorado grading platform kept its existing Base44 frontend; Leanware built the FastAPI backend behind it, including the API the frontend uses to render scores and overlay coordinate highlights on student photos.

AI agent education multi-model LangGraph FastAPI image analysis