HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY
← back to homepage
Evaluate LLMs like a proSKILL #TION
Coding

llm-evaluation

Evaluate LLMs like a pro

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

↗ github · ★ 37k·src: wshobson/agents

the manual

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

  • Measuring LLM application performance systematically
  • Comparing different models or prompts
  • Detecting performance regressions before deployment
  • Validating improvements from prompt changes
  • Building confidence in production systems
  • Establishing baselines and tracking progress over time
  • Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Text Generation:

  • BLEU: N-gram overlap (translation)
  • ROUGE: Recall-oriented (summarization)
  • METEOR: Semantic similarity
  • BERTScore: Embedding-based similarity
  • Perplexity: Language model confidence

Classification:

  • Accuracy: Percentage correct
  • Precision/Recall/F1: Class-specific performance
  • Confusion Matrix: Error patterns
  • AUC-ROC: Ranking quality

Retrieval (RAG):

  • MRR: Mean Reciprocal Rank
  • NDCG: Normalized Discounted Cumulative Gain
  • Precision@K: Relevant in top K
  • Recall@K: Coverage in top K

2. Human Evaluation

Manual assessment for quality aspects difficult to automate.

Dimensions:

  • Accuracy: Factual correctness
  • Coherence: Logical flow
  • Relevance: Answers the question
  • Fluency: Natural language quality
  • Safety: No harmful content
  • Helpfulness: Useful to the user

3. LLM-as-Judge

Use stronger LLMs to evaluate weaker model outputs.

Approaches:

  • Pointwise: Score individual responses
  • Pairwise: Compare two responses
  • Reference-based: Compare to gold standard
  • Reference-free: Judge without ground truth

Quick Start

from dataclasses import dataclass
from typing import Callable
import numpy as np

@dataclass
class Metric:
    name: str
    fn: Callable

    @staticmethod
    def accuracy():
        return Metric("accuracy", calculate_accuracy)

    @staticmethod
    def bleu():
        return Metric("bleu", calculate_bleu)

    @staticmethod
    def bertscore():
        return Metric("bertscore", calculate_bertscore)

    @staticmethod
    def custom(name: str, fn: Callable):
        return Metric(name, fn)

class EvaluationSuite:
    def __init__(self, metrics: list[Metric]):
        self.metrics = metrics

    async def evaluate(self, model, test_cases: list[dict]) -> dict:
        results = {m.name: [] for m in self.metrics}

        for test in test_cases:
            prediction = await model.predict(test["input"])

            for metric in self.metrics:
                score = metric.fn(
                    prediction=prediction,
                    reference=test.get("expected"),
                    context=test.get("context")
                )
                results[metric.name].append(score)

        return {
            "metrics": {k: np.mean(v) for k, v in results.items()},
            "raw_scores": results
        }

# Usage
suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
    Metric.custom("groundedness", check_groundedness)
])

test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
]

results = await suite.evaluate(model=your_model, test_cases=test_cases)

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.

more coding

Request code reviews to catch issues early
Coding
HOT
Request code reviews to catch issues early
requesting-code-review
2@ 2 240k
Execute plans flawlessly and efficiently
Coding
HOT
Execute plans flawlessly and efficiently
executing-plans
0@ 0 240k
Finish your dev branch like a pro
Coding
HOT
Finish your dev branch like a pro
finishing-a-development-branch
0@ 0 240k
Verify feedback before you implement changes
Coding
HOT
Verify feedback before you implement changes
receiving-code-review
0@ 0 240k
Debug systematically to save time
Coding
HOT
Debug systematically to save time
systematic-debugging
0@ 0 240k
Write tests first, code with confidence
Coding
HOT
Write tests first, code with confidence
test-driven-development
0@ 0 240k
Build powerful MCP servers fast
Coding
HOT
Build powerful MCP servers fast
mcp-builder
0@ 1 156k
Transform messy data into clean spreadsheets
Coding
HOT
Transform messy data into clean spreadsheets
xlsx
0@ 0 156k