Evaluating LLMs: metrics that matter
Beyond generic benchmarks: how to define success for your use case, measure regression, and involve the business.
Swapping models or prompts without evaluation is a blind deploy. For production you need metrics aligned with what users consider success—not just perplexity or open leaderboard scores.
The problem with generic benchmarks
MMLU, HellaSwag, and HumanEval help compare general capability. They don’t tell you if your assistant answers well on internal policies, keeps brand tone, or extracts structured data from your documents. A 95% MMLU model can score 60% on your domain.
Why public benchmarks aren’t enough
- They don’t cover your domain (legal, medical, finance, etc.)
- They don’t test the output shape you need (JSON, tables, citations)
- They don’t measure latency, cost, or refusal rate in real traffic
- They may suffer benchmark contamination in training data
- They miss your compliance and security requirements
Defining success with the business
Turn requirements into testable criteria: “must cite the correct source,” “must not invent clauses,” “must answer within N tokens,” “formal tone.” That becomes a human rubric or a partially automated checklist.
Building evaluation rubrics
A rubric is a set of specific measurable criteria. Example for HR: (1) Factuality—only company docs (40%), (2) Completeness—answers all parts (30%), (3) Tone—professional and respectful (15%), (4) Citations—references sources (15%). Score 0–3 per criterion and aggregate.
Involving stakeholders
Don’t define metrics alone. Work with end users, domain experts, legal/compliance, and product so you optimize what actually matters.
Layers of evaluation
Evaluation isn’t one number—it’s a pyramid of layers testing different aspects.
1. Retrieval (for RAG)
- Precision@k: how many of the top-k chunks are relevant?
- Recall@k: how many relevant chunks were retrieved?
- MRR: rank of the first relevant document
- NDCG: graded relevance with position discount
2. Generation (answer quality)
- Factual accuracy vs gold answers
- LLM-as-judge on defined criteria (complement, not replacement)
- Format compliance—valid JSON when required
- Groundedness—answer stays within provided context
3. Regression (changes didn’t break what worked)
A fixed regression suite runs in CI after prompt, model, RAG index, or post-processing changes. If accuracy drops beyond a threshold (e.g. 5%), block the deploy.
Building evaluation datasets
A good eval set is small but representative—100 well-chosen examples often beat 10k random ones.
How to build your dataset
- Common cases (~70%): questions you expect often
- Edge cases (~15%): ambiguity, multiple interpretations
- Adversarial (~10%): attempts to break the system
- “I don’t know” cases (~5%): correct answer is to admit lack of info
Data sources
Use real user logs (when allowed), expert-written questions, LLM-generated queries validated by humans, and support-reported failures.
LLM-as-judge: when it works and when it doesn’t
Using an LLM to grade another LLM scales cheaply but has pitfalls.
When LLM-as-judge works
- Objective checks: “contains a citation?”, “valid JSON?”
- Pairwise comparisons
- Clear rubric with exemplar scores
- High volume where human review isn’t feasible for everything
When LLM-as-judge fails
Judges can prefer longer answers, favor their own family of models, miss nuance, and be gamed. Always validate on a human sample.
Production signals: what really matters
Offline metrics matter but don’t replace online behavior—real users surprise you.
Behavioral metrics
- Thumbs up/down
- Abandonment after an answer
- Follow-up reformulations (weak answer signal)
- Time to next action
- Copy rate as a usefulness proxy
Technical metrics
- Latency p50/p95/p99
- Error rate
- Refusal / “I don’t know” rate
- Cost per query
- Citation rate for RAG
A/B tests and controlled experiments
With enough traffic, A/B is the best way to validate changes: control vs treatment, compare behavioral metrics.
Good practices for LLM A/B tests
- Sample size: often hundreds or thousands of queries
- Run at least ~1 week to capture variation
- Segment new vs returning users
- Adjust for multiple comparisons if testing many variants
- Avoid vanity metrics—more tokens ≠ better answers
Tools for LLM evaluation
- PromptFoo: open-source prompt testing and comparison
- LangSmith: datasets, run comparison, integrated judges
- Ragas: RAG metrics (faithfulness, answer relevancy, context recall)
- DeepEval: pytest-like assertions for LLM outputs
- Phoenix (Arize): embeddings, drift, traces
Conclusion
Evaluation is a system: offline datasets, LLM judges (with human checks), online metrics, and direct user feedback. Define success with stakeholders. Keep a regression suite so you don’t break what already works.
The goal isn’t maximizing a benchmark—it’s building something real users find useful and trustworthy that the business can afford.