Skip to content
Secure Private AI for Enterprises and Developers - amazee.ai

A/B Testing

This example shows how to use A/B testing to compare different evaluation strategies and find best configuration for your use case.

  1. Creating Multiple Evaluation Sets: Define different evaluation strategies

  2. Re-evaluating with Different Configs: Evaluate the same responses with multiple sets

  3. Comparing Results: Analyze which configuration performs better

  4. Deploying the Winner: Use the best configuration as default

Create two evaluation sets with different configurations.

  1. Navigate to /admin/content/ai-autoevals/sets

  2. Click “Add Evaluation Set”

  3. Configure:

    • Label: “Lenient Evaluation”
    • Description: “More forgiving evaluation for general content”
    • Fact Extraction Method: “AI Generated”
    • Choice Scores:
      • A (Exact Match): 1.0
      • B (Superset): 0.8
      • C (Subset): 0.6
      • D (Disagreement): 0.2
  4. Save

  1. Click “Add Evaluation Set”

  2. Configure:

    • Label: “Strict Evaluation”
    • Description: “Strict evaluation for high-stakes content”
    • Fact Extraction Method: “AI Generated”
    • Choice Scores:
      • A (Exact Match): 1.0
      • B (Superset): 0.6
      • C (Subset): 0.4
      • D (Disagreement): 0.0
  3. Save

Choose a representative sample of existing evaluations to test:

Terminal window
# Use Drush to get a sample of evaluation IDs
drush entity:query ai_autoevals_evaluation_result \
--status=completed \
--limit=100 \
--field=id

Or select manually from /admin/content/ai-autoevals/results.

Step 3: Re-evaluate with Different Configurations

Section titled “Step 3: Re-evaluate with Different Configurations”
  1. Navigate to /admin/content/ai-autoevals/results

  2. Filter to show completed evaluations

  3. Select the evaluations you want to re-evaluate

  4. Choose “Re-evaluate” action

  5. Select “Lenient Evaluation” as the new configuration

  6. Repeat with “Strict Evaluation”

  1. Navigate to /admin/content/ai-autoevals/batch

  2. Configure batch re-evaluation:

    • Filter: Status = completed
    • New Evaluation Set: “Lenient Evaluation”
    • Limit: 100 evaluations
  3. Execute batch

  4. Repeat with “Strict Evaluation”

<?php
$batchProcessor = \Drupal::service('ai_autoevals.batch_processor');
// Get sample of evaluation IDs
$evaluationIds = \Drupal::entityQuery('ai_autoevals_evaluation_result')
->accessCheck(FALSE)
->condition('status', 'completed')
->range(0, 100)
->execute();
// Re-evaluate with lenient configuration
$lenientResults = $batchProcessor->reEvaluateBatch(
$evaluationIds,
'lenient_evaluation'
);
// Re-evaluate with strict configuration
$strictResults = $batchProcessor->reEvaluateBatch(
$evaluationIds,
'strict_evaluation'
);
\Drupal::messenger()->addMessage(
t('Created @lenient lenient and @strict strict evaluations.', [
'@lenient' => count($lenientResults),
'@strict' => count($strictResults),
])
);
  1. Navigate to /admin/content/ai-autoevals

  2. View statistics by evaluation set

  3. Compare average scores between “Lenient” and “Strict”

<?php
$batchProcessor = \Drupal::service('ai_autoevals.batch_processor');
// Compare configurations
$comparison = $batchProcessor->compareConfigurations(
['lenient_evaluation', 'strict_evaluation'],
$evaluationIds
);
foreach ($comparison as $configId => $data) {
\Drupal::messenger()->addMessage(
t('@config: Average Score: @score, Count: @count', [
'@config' => $configId,
'@score' => number_format($data['average_score'], 2),
'@count' => $data['count'],
])
);
}
<?php
// Get average score for each configuration
foreach (['lenient_evaluation', 'strict_evaluation'] as $setId) {
$query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e');
$query->addExpression('AVG(e.score)', 'avg_score');
$query->addExpression('COUNT(e.id)', 'count');
$query->condition('e.evaluation_set_id', $setId);
$query->condition('e.status', 'completed');
$result = $query->execute()->fetchAssoc();
\Drupal::messenger()->addMessage(
t('@set: Average: @avg, Count: @count', [
'@set' => $setId,
'@avg' => number_format($result['avg_score'], 2),
'@count' => $result['count'],
])
);
}

Calculate if the difference is statistically significant:

<?php
/**
* Calculates if difference between two scores is significant.
*/
function isDifferenceSignificant(float $score1, float $score2, int $n1, int $n2): bool {
// Calculate standard error
$se = sqrt(($score1 * (1 - $score1) / $n1) + ($score2 * (1 - $score2) / $n2));
// Calculate z-score
$z = ($score1 - $score2) / $se;
// Check if significant (95% confidence level, z > 1.96)
return abs($z) > 1.96;
}
// Usage
$lenientAvg = 0.75;
$strictAvg = 0.68;
$lenientCount = 100;
$strictCount = 100;
if (isDifferenceSignificant($lenientAvg, $strictAvg, $lenientCount, $strictCount)) {
\Drupal::messenger()->addMessage('Difference is statistically significant');
}
else {
\Drupal::messenger()->addMessage('Difference is not statistically significant');
}

Compare score distributions:

<?php
/**
* Gets score distribution for an evaluation set.
*/
function getScoreDistribution(string $evaluationSetId): array {
$query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e');
$query->addExpression('COUNT(e.id)', 'count');
$query->addField('e', 'choice');
$query->condition('e.evaluation_set_id', $evaluationSetId);
$query->condition('e.status', 'completed');
$query->groupBy('e.choice');
return $query->execute()->fetchAllKeyed();
}
// Compare distributions
$lenientDist = getScoreDistribution('lenient_evaluation');
$strictDist = getScoreDistribution('strict_evaluation');
\Drupal::messenger()->addMessage('<pre>' . print_r([
'lenient' => $lenientDist,
'strict' => $strictDist,
], TRUE) . '</pre>');

After analyzing results, deploy the best configuration:

  1. Update Default Settings: Go to /admin/config/ai/autoevals and set the winning evaluation set as default

  2. Create New Evaluation Set: Optionally create a new set combining best practices from both

  3. Monitor Performance: Continue monitoring to ensure the chosen configuration performs well

Example: Testing Different Fact Extraction Methods

Section titled “Example: Testing Different Fact Extraction Methods”
<?php
// Create evaluation sets with different fact extraction methods
$methods = ['ai_generated', 'rule_based', 'hybrid'];
$results = [];
foreach ($methods as $method) {
// Create evaluation set with this method
$evaluationSet = EvaluationSet::create([
'id' => 'test_' . $method,
'label' => 'Test: ' . ucfirst($method),
'fact_extraction_method' => $method,
]);
$evaluationSet->save();
// Re-evaluate sample
$newIds = $batchProcessor->reEvaluateBatch(
$evaluationIds,
'test_' . $method
);
// Get average score
$avg = $this->getAverageScoreForSet('test_' . $method);
$results[$method] = $avg;
}
// Display results
arsort($results);
foreach ($results as $method => $avg) {
\Drupal::messenger()->addMessage(
t('@method: @avg', [
'@method' => ucfirst($method),
'@avg' => number_format($avg, 2),
])
);
}
protected function getAverageScoreForSet(string $setId): float {
$query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e');
$query->addExpression('AVG(e.score)', 'avg');
$query->condition('e.evaluation_set_id', $setId);
$query->condition('e.status', 'completed');
return (float) $query->execute()->fetchField();
}
  1. Use Representative Samples

    Ensure your sample represents the full range of your content:

    // Get diverse sample
    $evaluationIds = \Drupal::entityQuery('ai_autoevals_evaluation_result')
    ->accessCheck(FALSE)
    ->condition('status', 'completed')
    ->sort('created', 'DESC') // Get recent content
    ->range(0, 100) // Or use random sampling
    ->execute();
  2. Sufficient Sample Size

    Use a large enough sample for statistical significance:

    // Minimum of 100 evaluations per configuration
    $minimumSampleSize = 100;
    if (count($evaluationIds) < $minimumSampleSize) {
    \Drupal::messenger()->addWarning(
    t('Sample size is less than recommended minimum of @count.', [
    '@count' => $minimumSampleSize,
    ])
    );
    }
  3. Test Multiple Metrics

    Compare multiple metrics, not just average score:

    $metrics = [
    'average_score' => $this->getAverageScore($setId),
    'completion_rate' => $this->getCompletionRate($setId),
    'score_distribution' => $this->getScoreDistribution($setId),
    ];
  4. Document Results

    Keep records of your tests for future reference:

    $testLog = \Drupal::logger('ai_autoevals_ab_testing');
    $testLog->info('A/B Test Results', [
    'lenient_avg' => $lenientAvg,
    'strict_avg' => $strictAvg,
    'sample_size' => count($evaluationIds),
    'date' => date('Y-m-d H:i:s'),
    ]);
  5. Iterate and Refine

    A/B testing is an iterative process:

    • Test initial configurations
    • Analyze results
    • Refine configurations based on findings
    • Test again
    • Repeat until you find optimal configuration