Grading dental crown prep with a multi-model AI agent
A custom AI agent built with the University of Colorado that evaluates dental student crown preparations against configurable rubrics. Multi-model orchestration across OpenRouter, OpenAI, Google, and Anthropic, integrated with a Base44 frontend.
- Geography
- Colorado, United States
- Year
- 2025
- Stage
- Faculty-led EdTech project, scaling to multiple dental schools
- Team
- 1 senior full-stack engineer
- Duration
- Fixed-scope build, single phase, ongoing collaboration
The situation
Evan Menke, faculty at the University of Colorado, was using AI in a personal capacity to evaluate dental crown preparations, the physical reproductions dental students make of crown shapes during their training. The pattern that emerged was simple. A student finishes a crown, photographs it, runs it past the AI, and gets a numeric score against a rubric and an explanation of where it fell short. Faculty time saved. Students self-correct between formal reviews.
The personal tool worked. The natural question became: could the same pattern serve students across multiple dental schools, not just one faculty member's section? Scaling required moving from a single-user experiment to a backend that could handle multiple rubrics, switch between LLM providers as they evolve, store scoring history per student, and integrate with a frontend that faculty and students could actually use.
Evan brought Leanware in to build that backend. The frontend lived on Base44 already; the work was the evaluation engine, the rubric system, and the multi-model orchestration that would let the platform stay accurate as the underlying LLM ecosystem moved.
What we built
The engagement was scoped as a Managed Custom AI Agent: Leanware builds, deploys, and operates the system; the client uses it. Fixed scope, one senior full-stack engineer, with the option to extend into managed-service operation once the platform is in steady state.
The backend runs on FastAPI with LangGraph orchestrating LLM workflows and PostgreSQL holding rubrics, evaluation history, and scoring metadata. The provider layer abstracts across OpenRouter, OpenAI direct, Google, and Anthropic so the platform can swap models per rubric or per evaluation and run cross-provider comparisons without code changes.
The configuration surface is the part of the platform that buyers tend to underestimate. The system lets a faculty member configure the base prompt the LLM sees, the rubric structure (which criteria count, in what proportion), the score weights applied per criterion, and the thresholds at which the system surfaces a critical finding versus a minor one. As the rubric evolves (and dental rubrics evolve as faculty refine their teaching), the same backend handles the new shape without a rebuild.
A second non-obvious capability: for each evaluated image, the system returns percentage-based coordinate highlights for the regions the LLM scored against. The frontend uses these to overlay the model reasoning on the student photo, which converts an abstract score into a specific "here is where the margin is wrong" annotation.
Deployment is GitHub Actions to Render.com via Docker images. Before launch, Leanware ran initial accuracy testing across providers and delivered a comparative report covering which models scored highest on rubric adherence, which produced the most stable outputs across image variations, and which combinations were cost-effective at scale. The documentation and code shipped at a standard that supports both ongoing maintenance and a future expansion into training a custom model specialized for dental rubrics.
Outcome
-
Multi-provider AI evaluation backend in production across OpenRouter, OpenAI, Google, and Anthropic
-
Initial model comparison report delivered, covering accuracy and cost per provider
-
The AI agent keeps improving as it sees more cases
From client quote
The backend launched against the Base44 frontend and is in active use evaluating crown preparations. Multi-provider switching works in production. The rubric and prompt configuration surface gives Evan the ability to refine the grading system as more student work flows through it. The initial model comparison report is published as a reference for which provider performs best on which rubric type.
The collaboration is ongoing on a smaller cadence. As Evan accumulates evaluation samples, the path opens to training a custom model specialized for dental crown rubrics, which would lift accuracy beyond what the general-purpose LLMs produce today.
"I presented this to my clinic and the response was strong. Leanware did excellent work, and we're going to use feedback from real evaluations to keep refining the AI agent, improving grading accuracy organically as it sees more cases."
— Evan Menke , University of Colorado · Colorado, United States
Engagement line
Engagement FAQ
How do I benchmark AI model performance for image-based grading tasks?
Use a labeled dataset that mirrors real student submissions, test across varied lighting and resolution conditions, track rubric adherence, break down error categories, and compare cost per evaluation across model providers.
What QA processes are required to test AI evaluation consistency?
Benchmark datasets, cross-provider comparisons, image-quality variance testing, regression testing after model updates, and documented accuracy thresholds. Evaluation systems that grade student work need the same QA rigor as any other safety-relevant assessment tool.
How do I assess if a dev shop has real AI/ML expertise versus only API integration experience?
Ask the team to walk through their prompt-testing methodology, accuracy measurement process, error classification framework, and approach to switching between or comparing models. Real AI engineers can explain these clearly and have artifacts (model comparison logs, accuracy audit reports) to back them up.
How do I evaluate if a dev shop has real experience with AI image analysis for education?
Ask for architecture diagrams, model comparison data, annotated output samples, and reproducibility testing workflows. Teams relying only on plug-and-play APIs will not have this depth of evidence.