Getting Started
AI AutoEvals provides automated factuality evaluation of AI responses. This guide will help you get up and running quickly.
Installation
Section titled “Installation”-
Install the module using Composer:
Terminal window composer require drupal/ai_autoevalsdrush en ai_autoevals -
Configure the module at
/admin/config/ai/autoevals -
Visit the dashboard at
/admin/content/ai-autoevals
Basic Configuration
Section titled “Basic Configuration”-
Configure Default AI Provider: Set the provider and model to use for evaluations at
/admin/config/ai/autoevals:- Default Provider: Select your configured AI provider (e.g., OpenAI, Anthropic)
- Default Model: Choose the model for evaluations (e.g., GPT-4, Claude 3)
-
Enable Auto-Tracking: Check “Auto-track requests” to automatically evaluate all AI responses that match the configured operation types.
-
Configure Evaluation Settings:
- Operation Types: Which operations to evaluate (chat, chat_completion)
- Fact Extraction Method: Choose AI-generated, rule-based, or hybrid
- Context Depth: Number of conversation turns to include
- Retention Period: How long to keep evaluation results
Processing Evaluations
Section titled “Processing Evaluations”Evaluations are processed asynchronously via the Drupal Queue API. You can process them in two ways:
Automatic Processing
Section titled “Automatic Processing”Let cron process evaluations automatically (60 second time limit per cron run).
Manual Processing
Section titled “Manual Processing”Process the queue manually:
drush queue:run ai_autoevals_evaluation_workerView Results
Section titled “View Results”Check the dashboard at /admin/content/ai-autoevals to see:
- Total evaluations
- Average score
- Evaluations by status
- Evaluations by evaluation set
- Recent evaluations
- Score distribution
Understanding Scores
Section titled “Understanding Scores”Evaluations return scores from 0.0 to 1.0 based on factual accuracy:
| Score | Meaning | Description |
|---|---|---|
| 1.0 | Exact Match | Response fully meets expected criteria |
| 0.6 | Superset | Response includes all expected info plus more |
| 0.4 | Subset | Response has some expected info but missing some |
| 0.0 | Disagreement | Response contradicts expected facts |
| 1.0 | Irrelevant | Differences don’t affect factuality |