Basic Usage

This guide covers the basic usage of AI AutoEvals for automated factuality evaluation.

Configuration

Settings Configuration

Global Settings

Visit /admin/config/ai/autoevals to configure global settings:

Default Evaluation Provider: AI provider and model to use for running evaluations
Auto-track: Automatically evaluate all matching AI requests
Operation Types: Which AI operation types to evaluate (chat, chat_completion)
Fact Extraction Method: Default method for extracting evaluation criteria (fallback setting, see below)
Context Depth: Number of conversation turns to include for context
Retention Period: How long to keep evaluation results
Debug Mode: Enable additional logging for troubleshooting

Important: The “Fact Extraction Method” global setting is used as a fallback. Individual evaluation sets can override this setting with their own configuration. This allows you to have different extraction strategies for different types of content or evaluation sets.

Auto-Tracking

The easiest way to use AI AutoEvals is with auto-tracking enabled. This automatically evaluates all AI responses that match your configured operation types.

Enable Auto-Tracking

Go to /admin/config/ai/autoevals
Check the “Auto-track requests” checkbox
Save the configuration

All AI requests will now be automatically queued for evaluation.

Process Evaluations

Evaluations are processed asynchronously. Run the evaluation queue:

drush queue:run ai_autoevals_evaluation_worker

Or let cron process them automatically.

Manual Tracking

For more control, you can selectively track specific requests by adding a tag to your AI calls.

Tagging Requests

$ai_provider = \Drupal::service('ai.provider')->createInstance('amazeeio');
$response = $ai_provider->chat($input, $model, [
  'ai_autoevals:track' => TRUE,
]);

Only requests with this tag will be evaluated.

Adding Context to Tags

$response = $ai_provider->chat($input, $model, [
  'ai_autoevals:track' => TRUE,
  'category' => 'support',
  'priority' => 'high',
]);

This information is stored with the evaluation and can be used for filtering or routing.

Understanding Evaluation Scores

Evaluations return scores from 0.0 to 1.0 based on factual accuracy:

Score	Meaning	Description
1.0	Exact Match	Response fully meets expected criteria
0.6	Superset	Response includes all expected info plus more
0.4	Subset	Response has some expected info but missing some
0.0	Disagreement	Response contradicts expected facts

Scores are determined by comparing the AI response against evaluation criteria extracted from the user’s question.

View Results

Dashboard Overview

Visit /admin/content/ai-autoevals to see:

Total evaluations processed
Average score across all evaluations
Evaluations by status (pending, processing, completed, failed)
Evaluations by evaluation set
Recent evaluations
Score distribution chart

View Individual Evaluations

Click on any evaluation to see:

Original question and AI response
Extracted evaluation criteria
Score and analysis
Evaluation set used
Provider and model information
Timestamp and metadata

Filter Results

Use filters to find specific evaluations:

By status (pending, processing, completed, failed)
By evaluation set
By score range
By provider or model
By date range
By tags

Programmatic Usage

You can also create and manage evaluations programmatically:

$evaluationManager = \Drupal::service('ai_autoevals.evaluation_manager');

// Create evaluation
$evaluation = $evaluationManager->createEvaluation([
  'evaluation_set_id' => 'default',
  'request_id' => 'unique-request-id',
  'provider_id' => 'amazeeio',
  'model_id' => 'chat',
  'operation_type' => 'chat',
  'input' => 'What is the capital of France?',
  'output' => 'The capital of France is Paris.',
  'tags' => ['category' => 'geography'],
]);

// Queue for processing
$evaluationManager->queueEvaluation($evaluation->id());

Best Practices

Start with Auto-Tracking

Begin with auto-tracking enabled to get a baseline of your AI’s performance.
Monitor Scores Regularly

Check the dashboard regularly to track performance trends and identify issues.
Use Tags for Organization

Add tags to categorize your evaluations for better filtering and analysis.
Process Queue Regularly

Ensure the evaluation queue is processed regularly to avoid backlog.
Review Failed Evaluations

Investigate failed evaluations to identify configuration issues or API problems.

Next Steps

Learn about Evaluation Sets for advanced configuration
Explore the Dashboard for detailed analytics
Read the API Reference for programmatic usage