Techniques & Methods
Evaluation Metrics
Evaluation metrics provide objective measures of model quality. Common NLP metrics include BLEU (translation quality), ROUGE (summarization overlap), F1 score (classification), perplexity (language model quality), and BERTScore (semantic similarity). Benchmarks like MMLU, HellaSwag, and HumanEval assess LLM capabilities across reasoning, knowledge, and coding.
No single metric captures all aspects of quality. Human evaluation remains the gold standard for open-ended generation. Careful metric selection and interpretation are critical to avoid optimizing for metrics that don't reflect real-world performance.
Authority Links
Related Terms
Techniques & Methods
Response Quality
Evaluation of an AI response's relevance, coherence, accuracy, and helpfulness.
Techniques & Methods
Validation
Evaluating model performance on data held separate from the training set.
Core Concepts
Supervised Learning
Models trained on labeled data, learning to predict outcomes from inputs.
Techniques & Methods
Training
Teaching a model to make accurate predictions by exposing it to large datasets.

