A/B Testing
This example shows how to use A/B testing to compare different evaluation strategies and find best configuration for your use case.
Overview
Section titled “Overview”-
Creating Multiple Evaluation Sets: Define different evaluation strategies
-
Re-evaluating with Different Configs: Evaluate the same responses with multiple sets
-
Comparing Results: Analyze which configuration performs better
-
Deploying the Winner: Use the best configuration as default
Step 1: Create Evaluation Sets
Section titled “Step 1: Create Evaluation Sets”Create two evaluation sets with different configurations.
Set A: Lenient Evaluation
Section titled “Set A: Lenient Evaluation”-
Navigate to
/admin/content/ai-autoevals/sets -
Click “Add Evaluation Set”
-
Configure:
- Label: “Lenient Evaluation”
- Description: “More forgiving evaluation for general content”
- Fact Extraction Method: “AI Generated”
- Choice Scores:
- A (Exact Match): 1.0
- B (Superset): 0.8
- C (Subset): 0.6
- D (Disagreement): 0.2
-
Save
Set B: Strict Evaluation
Section titled “Set B: Strict Evaluation”-
Click “Add Evaluation Set”
-
Configure:
- Label: “Strict Evaluation”
- Description: “Strict evaluation for high-stakes content”
- Fact Extraction Method: “AI Generated”
- Choice Scores:
- A (Exact Match): 1.0
- B (Superset): 0.6
- C (Subset): 0.4
- D (Disagreement): 0.0
-
Save
Step 2: Select Sample Evaluations
Section titled “Step 2: Select Sample Evaluations”Choose a representative sample of existing evaluations to test:
# Use Drush to get a sample of evaluation IDsdrush entity:query ai_autoevals_evaluation_result \ --status=completed \ --limit=100 \ --field=idOr select manually from /admin/content/ai-autoevals/results.
Step 3: Re-evaluate with Different Configurations
Section titled “Step 3: Re-evaluate with Different Configurations”Using the Dashboard
Section titled “Using the Dashboard”-
Navigate to
/admin/content/ai-autoevals/results -
Filter to show completed evaluations
-
Select the evaluations you want to re-evaluate
-
Choose “Re-evaluate” action
-
Select “Lenient Evaluation” as the new configuration
-
Repeat with “Strict Evaluation”
Using Batch Operations
Section titled “Using Batch Operations”-
Navigate to
/admin/content/ai-autoevals/batch -
Configure batch re-evaluation:
- Filter: Status = completed
- New Evaluation Set: “Lenient Evaluation”
- Limit: 100 evaluations
-
Execute batch
-
Repeat with “Strict Evaluation”
Programmatically
Section titled “Programmatically”<?php
$batchProcessor = \Drupal::service('ai_autoevals.batch_processor');
// Get sample of evaluation IDs$evaluationIds = \Drupal::entityQuery('ai_autoevals_evaluation_result') ->accessCheck(FALSE) ->condition('status', 'completed') ->range(0, 100) ->execute();
// Re-evaluate with lenient configuration$lenientResults = $batchProcessor->reEvaluateBatch( $evaluationIds, 'lenient_evaluation');
// Re-evaluate with strict configuration$strictResults = $batchProcessor->reEvaluateBatch( $evaluationIds, 'strict_evaluation');
\Drupal::messenger()->addMessage( t('Created @lenient lenient and @strict strict evaluations.', [ '@lenient' => count($lenientResults), '@strict' => count($strictResults), ]));Step 4: Compare Results
Section titled “Step 4: Compare Results”Using the Dashboard
Section titled “Using the Dashboard”-
Navigate to
/admin/content/ai-autoevals -
View statistics by evaluation set
-
Compare average scores between “Lenient” and “Strict”
Using Comparison API
Section titled “Using Comparison API”<?php
$batchProcessor = \Drupal::service('ai_autoevals.batch_processor');
// Compare configurations$comparison = $batchProcessor->compareConfigurations( ['lenient_evaluation', 'strict_evaluation'], $evaluationIds);
foreach ($comparison as $configId => $data) { \Drupal::messenger()->addMessage( t('@config: Average Score: @score, Count: @count', [ '@config' => $configId, '@score' => number_format($data['average_score'], 2), '@count' => $data['count'], ]) );}Manual Comparison Query
Section titled “Manual Comparison Query”<?php
// Get average score for each configurationforeach (['lenient_evaluation', 'strict_evaluation'] as $setId) { $query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e'); $query->addExpression('AVG(e.score)', 'avg_score'); $query->addExpression('COUNT(e.id)', 'count'); $query->condition('e.evaluation_set_id', $setId); $query->condition('e.status', 'completed');
$result = $query->execute()->fetchAssoc();
\Drupal::messenger()->addMessage( t('@set: Average: @avg, Count: @count', [ '@set' => $setId, '@avg' => number_format($result['avg_score'], 2), '@count' => $result['count'], ]) );}Step 5: Analyze Results
Section titled “Step 5: Analyze Results”Statistical Significance
Section titled “Statistical Significance”Calculate if the difference is statistically significant:
<?php
/** * Calculates if difference between two scores is significant. */function isDifferenceSignificant(float $score1, float $score2, int $n1, int $n2): bool { // Calculate standard error $se = sqrt(($score1 * (1 - $score1) / $n1) + ($score2 * (1 - $score2) / $n2));
// Calculate z-score $z = ($score1 - $score2) / $se;
// Check if significant (95% confidence level, z > 1.96) return abs($z) > 1.96;}
// Usage$lenientAvg = 0.75;$strictAvg = 0.68;$lenientCount = 100;$strictCount = 100;
if (isDifferenceSignificant($lenientAvg, $strictAvg, $lenientCount, $strictCount)) { \Drupal::messenger()->addMessage('Difference is statistically significant');}else { \Drupal::messenger()->addMessage('Difference is not statistically significant');}Score Distribution
Section titled “Score Distribution”Compare score distributions:
<?php
/** * Gets score distribution for an evaluation set. */function getScoreDistribution(string $evaluationSetId): array { $query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e'); $query->addExpression('COUNT(e.id)', 'count'); $query->addField('e', 'choice'); $query->condition('e.evaluation_set_id', $evaluationSetId); $query->condition('e.status', 'completed'); $query->groupBy('e.choice');
return $query->execute()->fetchAllKeyed();}
// Compare distributions$lenientDist = getScoreDistribution('lenient_evaluation');$strictDist = getScoreDistribution('strict_evaluation');
\Drupal::messenger()->addMessage('<pre>' . print_r([ 'lenient' => $lenientDist, 'strict' => $strictDist,], TRUE) . '</pre>');Step 6: Deploy the Winner
Section titled “Step 6: Deploy the Winner”After analyzing results, deploy the best configuration:
-
Update Default Settings: Go to
/admin/config/ai/autoevalsand set the winning evaluation set as default -
Create New Evaluation Set: Optionally create a new set combining best practices from both
-
Monitor Performance: Continue monitoring to ensure the chosen configuration performs well
Example: Testing Different Fact Extraction Methods
Section titled “Example: Testing Different Fact Extraction Methods”<?php
// Create evaluation sets with different fact extraction methods$methods = ['ai_generated', 'rule_based', 'hybrid'];$results = [];
foreach ($methods as $method) { // Create evaluation set with this method $evaluationSet = EvaluationSet::create([ 'id' => 'test_' . $method, 'label' => 'Test: ' . ucfirst($method), 'fact_extraction_method' => $method, ]); $evaluationSet->save();
// Re-evaluate sample $newIds = $batchProcessor->reEvaluateBatch( $evaluationIds, 'test_' . $method );
// Get average score $avg = $this->getAverageScoreForSet('test_' . $method); $results[$method] = $avg;}
// Display resultsarsort($results);foreach ($results as $method => $avg) { \Drupal::messenger()->addMessage( t('@method: @avg', [ '@method' => ucfirst($method), '@avg' => number_format($avg, 2), ]) );}
protected function getAverageScoreForSet(string $setId): float { $query = \Drupal::database()->select('ai_autoevals_evaluation_result', 'e'); $query->addExpression('AVG(e.score)', 'avg'); $query->condition('e.evaluation_set_id', $setId); $query->condition('e.status', 'completed');
return (float) $query->execute()->fetchField();}Best Practices
Section titled “Best Practices”-
Use Representative Samples
Ensure your sample represents the full range of your content:
// Get diverse sample$evaluationIds = \Drupal::entityQuery('ai_autoevals_evaluation_result')->accessCheck(FALSE)->condition('status', 'completed')->sort('created', 'DESC') // Get recent content->range(0, 100) // Or use random sampling->execute(); -
Sufficient Sample Size
Use a large enough sample for statistical significance:
// Minimum of 100 evaluations per configuration$minimumSampleSize = 100;if (count($evaluationIds) < $minimumSampleSize) {\Drupal::messenger()->addWarning(t('Sample size is less than recommended minimum of @count.', ['@count' => $minimumSampleSize,]));} -
Test Multiple Metrics
Compare multiple metrics, not just average score:
$metrics = ['average_score' => $this->getAverageScore($setId),'completion_rate' => $this->getCompletionRate($setId),'score_distribution' => $this->getScoreDistribution($setId),]; -
Document Results
Keep records of your tests for future reference:
$testLog = \Drupal::logger('ai_autoevals_ab_testing');$testLog->info('A/B Test Results', ['lenient_avg' => $lenientAvg,'strict_avg' => $strictAvg,'sample_size' => count($evaluationIds),'date' => date('Y-m-d H:i:s'),]); -
Iterate and Refine
A/B testing is an iterative process:
- Test initial configurations
- Analyze results
- Refine configurations based on findings
- Test again
- Repeat until you find optimal configuration
Next Steps
Section titled “Next Steps”- Content Moderation - Content moderation workflow
- Custom Fact Extractors - Domain-specific evaluation
- Event System - Event system guide