A/B Testing

This example shows how to use A/B testing to compare different evaluation strategies and find best configuration for your use case.

Overview

Creating Multiple Evaluation Sets: Define different evaluation strategies
Re-evaluating with Different Configs: Evaluate the same responses with multiple sets
Comparing Results: Analyze which configuration performs better
Deploying the Winner: Use the best configuration as default

Step 1: Create Evaluation Sets

Create two evaluation sets with different configurations.

Set A: Lenient Evaluation

Navigate to /admin/content/ai-autoevals/sets
Click “Add Evaluation Set”
Configure:
- Label: “Lenient Evaluation”
- Description: “More forgiving evaluation for general content”
- Fact Extraction Method: “AI Generated”
- Choice Scores:
  - A (Exact Match): 1.0
  - B (Superset): 0.8
  - C (Subset): 0.6
  - D (Disagreement): 0.2
Save

Set B: Strict Evaluation

Click “Add Evaluation Set”
Configure:
- Label: “Strict Evaluation”
- Description: “Strict evaluation for high-stakes content”
- Fact Extraction Method: “AI Generated”
- Choice Scores:
  - A (Exact Match): 1.0
  - B (Superset): 0.6
  - C (Subset): 0.4
  - D (Disagreement): 0.0
Save

Step 2: Select Sample Evaluations

Choose a representative sample of existing evaluations to test:

# Use Drush to get a sample of evaluation IDs
drush entity:query ai_autoevals_evaluation_result \
  --status=completed \
  --limit=100 \
  --field=id

Or select manually from /admin/content/ai-autoevals/results.

Step 3: Re-evaluate with Different Configurations

Using the Dashboard

Navigate to /admin/content/ai-autoevals/results
Filter to show completed evaluations
Select the evaluations you want to re-evaluate
Choose “Re-evaluate” action
Select “Lenient Evaluation” as the new configuration
Repeat with “Strict Evaluation”

Using Batch Operations

Navigate to /admin/content/ai-autoevals/batch
Configure batch re-evaluation:
- Filter: Status = completed
- New Evaluation Set: “Lenient Evaluation”
- Limit: 100 evaluations
Execute batch
Repeat with “Strict Evaluation”

Programmatically

<?php

$batchProcessor = \Drupal::service('ai_autoevals.batch_processor');

// Get sample of evaluation IDs
$evaluationIds = \Drupal::entityQuery('ai_autoevals_evaluation_result')
  ->accessCheck(FALSE)
  ->condition('status', 'completed')
  ->range(0, 100)
  ->execute();

// Re-evaluate with lenient configuration
$lenientResults = $batchProcessor->reEvaluateBatch(
  $evaluationIds,
  'lenient_evaluation'
);

// Re-evaluate with strict configuration
$strictResults = $batchProcessor->reEvaluateBatch(
  $evaluationIds,
  'strict_evaluation'
);

\Drupal::messenger()->addMessage(
  t('Created @lenient lenient and @strict strict evaluations.', [
    '@lenient' => count($lenientResults),
    '@strict' => count($strictResults),
  ])
);

Step 4: Compare Results

Using the Dashboard

Navigate to /admin/content/ai-autoevals
View statistics by evaluation set
Compare average scores between “Lenient” and “Strict”

Using Comparison API

<?php

$batchProcessor = \Drupal::service('ai_autoevals.batch_processor');

// Compare configurations
$comparison = $batchProcessor->compareConfigurations(
  ['lenient_evaluation', 'strict_evaluation'],
  $evaluationIds
);

foreach ($comparison as $configId => $data) {
  \Drupal::messenger()->addMessage(
    t('@config: Average Score: @score, Count: @count', [
      '@config' => $configId,
      '@score' => number_format($data['average_score'], 2),
      '@count' => $data['count'],
    ])
  );
}

Manual Comparison Query

<?php

// Get average score for each configuration
foreach (['lenient_evaluation', 'strict_evaluation'] as $setId) {
  $query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e');
  $query->addExpression('AVG(e.score)', 'avg_score');
  $query->addExpression('COUNT(e.id)', 'count');
  $query->condition('e.evaluation_set_id', $setId);
  $query->condition('e.status', 'completed');

  $result = $query->execute()->fetchAssoc();

  \Drupal::messenger()->addMessage(
    t('@set: Average: @avg, Count: @count', [
      '@set' => $setId,
      '@avg' => number_format($result['avg_score'], 2),
      '@count' => $result['count'],
    ])
  );
}

Step 5: Analyze Results

Statistical Significance

Calculate if the difference is statistically significant:

<?php

/**
 * Calculates if difference between two scores is significant.
 */
function isDifferenceSignificant(float $score1, float $score2, int $n1, int $n2): bool {
  // Calculate standard error
  $se = sqrt(($score1 * (1 - $score1) / $n1) + ($score2 * (1 - $score2) / $n2));

  // Calculate z-score
  $z = ($score1 - $score2) / $se;

  // Check if significant (95% confidence level, z > 1.96)
  return abs($z) > 1.96;
}

// Usage
$lenientAvg = 0.75;
$strictAvg = 0.68;
$lenientCount = 100;
$strictCount = 100;

if (isDifferenceSignificant($lenientAvg, $strictAvg, $lenientCount, $strictCount)) {
  \Drupal::messenger()->addMessage('Difference is statistically significant');
}
else {
  \Drupal::messenger()->addMessage('Difference is not statistically significant');
}

Score Distribution

Compare score distributions:

<?php

/**
 * Gets score distribution for an evaluation set.
 */
function getScoreDistribution(string $evaluationSetId): array {
  $query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e');
  $query->addExpression('COUNT(e.id)', 'count');
  $query->addField('e', 'choice');
  $query->condition('e.evaluation_set_id', $evaluationSetId);
  $query->condition('e.status', 'completed');
  $query->groupBy('e.choice');

  return $query->execute()->fetchAllKeyed();
}

// Compare distributions
$lenientDist = getScoreDistribution('lenient_evaluation');
$strictDist = getScoreDistribution('strict_evaluation');

\Drupal::messenger()->addMessage('<pre>' . print_r([
  'lenient' => $lenientDist,
  'strict' => $strictDist,
], TRUE) . '</pre>');

Step 6: Deploy the Winner

After analyzing results, deploy the best configuration:

Update Default Settings: Go to /admin/config/ai/autoevals and set the winning evaluation set as default
Create New Evaluation Set: Optionally create a new set combining best practices from both
Monitor Performance: Continue monitoring to ensure the chosen configuration performs well

Example: Testing Different Fact Extraction Methods

<?php

// Create evaluation sets with different fact extraction methods
$methods = ['ai_generated', 'rule_based', 'hybrid'];
$results = [];

foreach ($methods as $method) {
  // Create evaluation set with this method
  $evaluationSet = EvaluationSet::create([
    'id' => 'test_' . $method,
    'label' => 'Test: ' . ucfirst($method),
    'fact_extraction_method' => $method,
  ]);
  $evaluationSet->save();

  // Re-evaluate sample
  $newIds = $batchProcessor->reEvaluateBatch(
    $evaluationIds,
    'test_' . $method
  );

  // Get average score
  $avg = $this->getAverageScoreForSet('test_' . $method);
  $results[$method] = $avg;
}

// Display results
arsort($results);
foreach ($results as $method => $avg) {
  \Drupal::messenger()->addMessage(
    t('@method: @avg', [
      '@method' => ucfirst($method),
      '@avg' => number_format($avg, 2),
    ])
  );
}

protected function getAverageScoreForSet(string $setId): float {
  $query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e');
  $query->addExpression('AVG(e.score)', 'avg');
  $query->condition('e.evaluation_set_id', $setId);
  $query->condition('e.status', 'completed');

  return (float) $query->execute()->fetchField();
}

Best Practices

Use Representative Samples

Ensure your sample represents the full range of your content:

// Get diverse sample
$evaluationIds = \Drupal::entityQuery('ai_autoevals_evaluation_result')
  ->accessCheck(FALSE)
  ->condition('status', 'completed')
  ->sort('created', 'DESC')  // Get recent content
  ->range(0, 100)  // Or use random sampling
  ->execute();

Sufficient Sample Size

Use a large enough sample for statistical significance:

// Minimum of 100 evaluations per configuration
$minimumSampleSize = 100;

if (count($evaluationIds) < $minimumSampleSize) {
  \Drupal::messenger()->addWarning(
    t('Sample size is less than recommended minimum of @count.', [
      '@count' => $minimumSampleSize,
    ])
  );
}

Test Multiple Metrics

Compare multiple metrics, not just average score:

$metrics = [
  'average_score' => $this->getAverageScore($setId),
  'completion_rate' => $this->getCompletionRate($setId),
  'score_distribution' => $this->getScoreDistribution($setId),
];

Document Results

Keep records of your tests for future reference:

$testLog = \Drupal::logger('ai_autoevals_ab_testing');
$testLog->info('A/B Test Results', [
  'lenient_avg' => $lenientAvg,
  'strict_avg' => $strictAvg,
  'sample_size' => count($evaluationIds),
  'date' => date('Y-m-d H:i:s'),
]);

Iterate and Refine

A/B testing is an iterative process:
- Test initial configurations
- Analyze results
- Refine configurations based on findings
- Test again
- Repeat until you find optimal configuration

Next Steps

Content Moderation - Content moderation workflow
Custom Fact Extractors - Domain-specific evaluation
Event System - Event system guide