Skip to content
Secure Private AI for Enterprises and Developers - amazee.ai

Basic Usage

This guide covers the basic usage of AI AutoEvals for automated factuality evaluation.

Settings Configuration

Visit /admin/config/ai/autoevals to configure global settings:

  • Default Evaluation Provider: AI provider and model to use for running evaluations
  • Auto-track: Automatically evaluate all matching AI requests
  • Operation Types: Which AI operation types to evaluate (chat, chat_completion)
  • Fact Extraction Method: Default method for extracting evaluation criteria (fallback setting, see below)
  • Context Depth: Number of conversation turns to include for context
  • Retention Period: How long to keep evaluation results
  • Debug Mode: Enable additional logging for troubleshooting

Important: The “Fact Extraction Method” global setting is used as a fallback. Individual evaluation sets can override this setting with their own configuration. This allows you to have different extraction strategies for different types of content or evaluation sets.

The easiest way to use AI AutoEvals is with auto-tracking enabled. This automatically evaluates all AI responses that match your configured operation types.

  1. Go to /admin/config/ai/autoevals

  2. Check the “Auto-track requests” checkbox

  3. Save the configuration

All AI requests will now be automatically queued for evaluation.

Evaluations are processed asynchronously. Run the evaluation queue:

Terminal window
drush queue:run ai_autoevals_evaluation_worker

Or let cron process them automatically.

For more control, you can selectively track specific requests by adding a tag to your AI calls.

$ai_provider = \Drupal::service('ai.provider')->createInstance('amazeeio');
$response = $ai_provider->chat($input, $model, [
'ai_autoevals:track' => TRUE,
]);

Only requests with this tag will be evaluated.

$response = $ai_provider->chat($input, $model, [
'ai_autoevals:track' => TRUE,
'category' => 'support',
'priority' => 'high',
]);

This information is stored with the evaluation and can be used for filtering or routing.

Evaluations return scores from 0.0 to 1.0 based on factual accuracy:

ScoreMeaningDescription
1.0Exact MatchResponse fully meets expected criteria
0.6SupersetResponse includes all expected info plus more
0.4SubsetResponse has some expected info but missing some
0.0DisagreementResponse contradicts expected facts

Scores are determined by comparing the AI response against evaluation criteria extracted from the user’s question.

Visit /admin/content/ai-autoevals to see:

  • Total evaluations processed
  • Average score across all evaluations
  • Evaluations by status (pending, processing, completed, failed)
  • Evaluations by evaluation set
  • Recent evaluations
  • Score distribution chart

Click on any evaluation to see:

  • Original question and AI response
  • Extracted evaluation criteria
  • Score and analysis
  • Evaluation set used
  • Provider and model information
  • Timestamp and metadata

Use filters to find specific evaluations:

  • By status (pending, processing, completed, failed)
  • By evaluation set
  • By score range
  • By provider or model
  • By date range
  • By tags

You can also create and manage evaluations programmatically:

$evaluationManager = \Drupal::service('ai_autoevals.evaluation_manager');
// Create evaluation
$evaluation = $evaluationManager->createEvaluation([
'evaluation_set_id' => 'default',
'request_id' => 'unique-request-id',
'provider_id' => 'amazeeio',
'model_id' => 'chat',
'operation_type' => 'chat',
'input' => 'What is the capital of France?',
'output' => 'The capital of France is Paris.',
'tags' => ['category' => 'geography'],
]);
// Queue for processing
$evaluationManager->queueEvaluation($evaluation->id());
  1. Start with Auto-Tracking

    Begin with auto-tracking enabled to get a baseline of your AI’s performance.

  2. Monitor Scores Regularly

    Check the dashboard regularly to track performance trends and identify issues.

  3. Use Tags for Organization

    Add tags to categorize your evaluations for better filtering and analysis.

  4. Process Queue Regularly

    Ensure the evaluation queue is processed regularly to avoid backlog.

  5. Review Failed Evaluations

    Investigate failed evaluations to identify configuration issues or API problems.