Evaluating LLMs: metrics that matter · Paulo Henrique Andrade Alves

Swapping models or prompts without evaluation is a blind deploy. For production you need metrics aligned with what users consider success—not just perplexity or open leaderboard scores.

The problem with generic benchmarks

MMLU, HellaSwag, and HumanEval help compare general capability. They don’t tell you if your assistant answers well on internal policies, keeps brand tone, or extracts structured data from your documents. A 95% MMLU model can score 60% on your domain.

Why public benchmarks aren’t enough

They don’t cover your domain (legal, medical, finance, etc.)
They don’t test the output shape you need (JSON, tables, citations)
They don’t measure latency, cost, or refusal rate in real traffic
They may suffer benchmark contamination in training data
They miss your compliance and security requirements

Defining success with the business

Turn requirements into testable criteria: “must cite the correct source,” “must not invent clauses,” “must answer within N tokens,” “formal tone.” That becomes a human rubric or a partially automated checklist.

Building evaluation rubrics

A rubric is a set of specific measurable criteria. Example for HR: (1) Factuality—only company docs (40%), (2) Completeness—answers all parts (30%), (3) Tone—professional and respectful (15%), (4) Citations—references sources (15%). Score 0–3 per criterion and aggregate.

Involving stakeholders

Don’t define metrics alone. Work with end users, domain experts, legal/compliance, and product so you optimize what actually matters.

Layers of evaluation

Evaluation isn’t one number—it’s a pyramid of layers testing different aspects.

1. Retrieval (for RAG)

Precision@k: how many of the top-k chunks are relevant?
Recall@k: how many relevant chunks were retrieved?
MRR: rank of the first relevant document
NDCG: graded relevance with position discount

2. Generation (answer quality)

Factual accuracy vs gold answers
LLM-as-judge on defined criteria (complement, not replacement)
Format compliance—valid JSON when required
Groundedness—answer stays within provided context

3. Regression (changes didn’t break what worked)

A fixed regression suite runs in CI after prompt, model, RAG index, or post-processing changes. If accuracy drops beyond a threshold (e.g. 5%), block the deploy.

Building evaluation datasets

A good eval set is small but representative—100 well-chosen examples often beat 10k random ones.

How to build your dataset

Common cases (~70%): questions you expect often
Edge cases (~15%): ambiguity, multiple interpretations
Adversarial (~10%): attempts to break the system
“I don’t know” cases (~5%): correct answer is to admit lack of info

Data sources

Use real user logs (when allowed), expert-written questions, LLM-generated queries validated by humans, and support-reported failures.

LLM-as-judge: when it works and when it doesn’t

Using an LLM to grade another LLM scales cheaply but has pitfalls.

When LLM-as-judge works

Objective checks: “contains a citation?”, “valid JSON?”
Pairwise comparisons
Clear rubric with exemplar scores
High volume where human review isn’t feasible for everything

When LLM-as-judge fails

Judges can prefer longer answers, favor their own family of models, miss nuance, and be gamed. Always validate on a human sample.

Production signals: what really matters

Offline metrics matter but don’t replace online behavior—real users surprise you.

Behavioral metrics

Thumbs up/down
Abandonment after an answer
Follow-up reformulations (weak answer signal)
Time to next action
Copy rate as a usefulness proxy

Technical metrics

Latency p50/p95/p99
Error rate
Refusal / “I don’t know” rate
Cost per query
Citation rate for RAG

A/B tests and controlled experiments

With enough traffic, A/B is the best way to validate changes: control vs treatment, compare behavioral metrics.

Good practices for LLM A/B tests

Sample size: often hundreds or thousands of queries
Run at least ~1 week to capture variation
Segment new vs returning users
Adjust for multiple comparisons if testing many variants
Avoid vanity metrics—more tokens ≠ better answers

Tools for LLM evaluation

PromptFoo: open-source prompt testing and comparison
LangSmith: datasets, run comparison, integrated judges
Ragas: RAG metrics (faithfulness, answer relevancy, context recall)
DeepEval: pytest-like assertions for LLM outputs
Phoenix (Arize): embeddings, drift, traces

Conclusion

Evaluation is a system: offline datasets, LLM judges (with human checks), online metrics, and direct user feedback. Define success with stakeholders. Keep a regression suite so you don’t break what already works.

The goal isn’t maximizing a benchmark—it’s building something real users find useful and trustworthy that the business can afford.