Architecture

This guide explains the architecture and key components of the AI AutoEvals module.

Overview

AI AutoEvals is built around a two-step LLM evaluation process:

Fact Extraction: Analyze the user’s question to determine what a correct answer should contain
Response Evaluation: Use a second LLM call to compare the AI response against extracted criteria

The key insight is that evaluation criteria are derived solely from the user’s question and context, not from the AI response itself. This avoids evaluation bias and ensures objective factuality checking.

System Flow

 AI Request
     ↓
  [Event Subscriber] PreGenerateResponseEvent
     ↓
  Check: ai_autoevals:internal tag?
     ↓ Yes → Skip (internal AI request)
     ↓ No
  Check: Operation type configured?
     ↓ No → Skip
     ↓ Yes
  Check: Auto-track OR ai_autoevals:track tag?
     ↓ No → Skip
     ↓ Yes
  Find matching Evaluation Set
       ↓
  1. Check: Global query exclusion keywords? (Circuit Breaker)
     ↓ Match → Abort ALL evaluations (highest priority)
     ↓ No match
  2. Identify Candidates:
       - Get all enabled sets sorted by weight (lowest first)
       - Filter by operation type AND tags
       - Empty tags = match all requests
       ↓
  3. [Hook] Invoke hook_ai_autoevals_evaluation_sets_alter()
       - Modules can remove sets from candidates
       - Context: operation_type, tags, input_text, output_text
       ↓
  4. Check: Any evaluation sets remain?
       ↓ No → Skip
       ↓ Yes
  5. Iterate candidates in weight order (Fall-through logic):
       For each candidate set:
         a. Check: Per-set query exclusion keywords?
            ↓ Match → Skip THIS set, try next
            ↓ No match
         b. Check: Query trigger keywords?
            ↓ No match → Skip THIS set, try next
            ↓ Yes → SELECT THIS SET (winner)
         c. If no keywords defined → SELECT THIS SET (winner)
       ↓
  6. If all candidates exhausted → No evaluation
     ↓
  [Conversation Tracker] Track request context
     ↓
  Store pending evaluation
     ↓
  AI Response Generated
     ↓
  [Event Subscriber] PostGenerateResponseEvent
     ↓
  Check: Global response exclusion keywords?
     ↓ Match → Skip ALL evaluations (highest priority)
     ↓ No match
  Check: Per-set response exclusion keywords?
     ↓ Match → Skip this set (second priority)
     ↓ No match
  Check: Response trigger keywords match? (if defined)
     ↓ No → Skip
     ↓ Yes
 [Evaluation Manager] Create evaluation entity
     ↓
 [Queue Worker] Process async
     ↓
 [Fact Extractor] Extract evaluation criteria (AI request tagged with ai_autoevals:internal)
     ↓
 [Evaluator] Evaluate response against criteria (AI request tagged with ai_autoevals:internal)
     ↓
 [Event Dispatcher] Dispatch PostEvaluationEvent
     ↓
 [Database] Store result

Preventing Infinite Evaluation Loops

The `ai_autoevals:internal` Tag

The evaluation process requires making additional AI requests:

Fact Extraction: Extract evaluation criteria from user input
Response Evaluation: Evaluate AI response against criteria

These internal AI requests would normally trigger the AutoEvals system again, creating an infinite loop of evaluations evaluating evaluations.

To prevent this, the module uses the ai_autoevals:internal tag:

When Adding Internal Requests:

FactExtractor and Evaluator services add ['ai_autoevals:internal'] to all AI requests
Example: ->chat($input, $modelId, ['ai_autoevals:internal'])

When Checking Requests:

The event subscriber checks for this tag first in onPreGenerateResponse()
If present, the request is immediately skipped, preventing recursive evaluation

Why This Matters:

Prevents infinite loops and resource exhaustion
Separates user requests from internal evaluation requests
Ensures evaluations don’t generate evaluations
Maintains system stability and performance

Available Tags

Tag	Purpose	Added By
`ai_autoevals:internal`	Marks internal AI requests (skipped from evaluation)	Module internals
`ai_autoevals:track`	Requests manual evaluation when auto-tracking is disabled	Your code

The ai_autoevals:internal tag is automatically added by the module and should not be added manually.

Core Components

Configuration Services

AiAutoevalsConfig

Service ID: ai_autoevals.config

Centralized configuration service for accessing module settings and AI provider configuration.

Responsibilities:

Access default AI provider and model settings
Retrieve global configuration values
Provide fallback to system defaults
Check configuration status

Key Methods:

getProviderId(): Get configured AI provider
getModelId(): Get configured AI model
getProvider(): Get AI provider instance
isConfigured(): Check if provider is configured
getGlobalExcludeQueryKeywords(): Get global query exclusions
getGlobalExcludeResponseKeywords(): Get global response exclusions
getOperationTypes(): Get configured operation types
isAutoTrackEnabled(): Check auto-track status
isDebugMode(): Check debug mode

KeywordMatcher

Service ID: ai_autoevals.keyword_matcher

Reusable service for keyword matching logic used throughout the module.

Responsibilities:

Match keywords in text with case-insensitive comparison
Support ‘any’ and ‘all’ match modes
Normalize and validate keywords

Key Methods:

matchesAny(string $text, array $keywords): Check if any keyword matches
matchesAll(string $text, array $keywords): Check if all keywords match
matches(string $text, array $keywords, string $mode): Generic matching method

Core Services

1. Evaluation Manager

Service ID: ai_autoevals.evaluation_manager

The central service that coordinates the evaluation lifecycle.

Responsibilities:

Create and manage evaluation entities
Queue evaluations for processing
Track evaluation status
Retrieve evaluation history and statistics
Route evaluations to matching evaluation sets

Key Methods:

createEvaluation(): Create new evaluation
queueEvaluation(): Queue for processing
getMatchingEvaluationSet(): Find matching configuration (legacy, doesn’t check keywords)
getMatchingEvaluationSetWithHook(): Find matching configuration with hook support (recommended method)
getStatistics(): Get dashboard statistics

2. Fact Extractor

Service ID: ai_autoevals.fact_extractor

Extracts evaluation criteria from user input using pluggable strategies.

Responsibilities:

Analyze user question to identify key facts
Generate evaluation criteria
Use custom knowledge for domain-specific extraction
Cache extraction results

Plugin Types:

AI Generated: Uses LLM to extract facts
Rule-Based: Uses patterns and rules
Hybrid: Combines AI and rule-based methods
Custom: Custom fact extractor plugins

Key Methods:

extractFacts(): Extract criteria from input
selectPlugin(): Select appropriate extraction plugin

3. Evaluator

Service ID: ai_autoevals.evaluator

Evaluates AI responses against extracted criteria.

Responsibilities:

Load evaluation prompt template
Construct evaluation prompt with facts and response
Call LLM for evaluation
Parse response to extract choice and analysis
Calculate score based on choice

Key Methods:

evaluate(): Perform evaluation
loadPromptTemplate(): Load custom prompt
parseResponse(): Parse LLM response
calculateScore(): Calculate score from choice

4. Conversation Tracker

Service ID: ai_autoevals.conversation_tracker

Maintains conversation context across multi-turn interactions.

Responsibilities:

Track conversation threads
Maintain parent-child relationships
Retrieve conversation context
Clear conversation data

Key Methods:

trackConversation(): Track a conversation turn
getConversationContext(): Retrieve context
isFollowUp(): Check if request is a follow-up
getThreadRoot(): Find thread root

5. Batch Processor

Service ID: ai_autoevals.batch_processor

Handles batch operations on evaluations.

Responsibilities:

Re-evaluate multiple evaluations
Compare evaluation configurations
Requeue failed evaluations
Schedule batch re-evaluations

Key Methods:

reEvaluateBatch(): Re-evaluate multiple items
compareConfigurations(): Compare sets
requeueAllFailed(): Requeue all failed

6. Event Subscriber

Class: Drupal\ai_autoevals\EventSubscriber\AiAutoevalsSubscriber

Listens to AI module events and triggers evaluations.

Responsibilities:

Listen to ai.request.post_generate_response events
Check if evaluation should be triggered using KeywordMatcher
Use AiAutoevalsConfig for configuration access
Create evaluation entities
Queue evaluations

Key Features:

Uses ai_autoevals.config for centralized configuration access
Uses ai_autoevals.keyword_matcher for all keyword matching logic
Implements exclusion keyword priority (global > per-set > trigger)

Data Model

EvaluationResult Entity

Entity Type ID: ai_autoevals_evaluation_result

Content entity storing evaluation results.

Key Fields:

evaluation_set_id: Reference to evaluation set configuration
request_id: Unique identifier for the AI request
request_parent_id: Parent request ID for conversation tracking
provider_id: AI provider used
model_id: AI model used
operation_type: Type of operation (chat, chat_completion)
input: User’s input/question
output: AI’s response
facts: Extracted evaluation criteria (JSON)
status: Evaluation status (pending, processing, completed, failed)
score: Final score (0.0 - 1.0)
choice: Evaluation choice (A, B, C, D)
analysis: LLM’s analysis
tags: Associated tags (JSON)
metadata: Additional metadata (JSON)

EvaluationSet Entity

Entity Type ID: ai_autoevals_evaluation_set

Config entity storing evaluation configurations.

Key Fields:

label: Configuration name
description: Configuration description
operation_types: Operations to evaluate
fact_extraction_method: Method for extracting facts
custom_knowledge: Domain-specific knowledge
prompt_template_id: Custom prompt template
custom_prompt_template: Custom prompt override
choice_scores: Scoring for each choice (JSON)
tags: Tag filters (JSON)
query_keywords: Keywords to match in user queries (array of strings)
response_keywords: Keywords to match in AI responses (array of strings)
keyword_match_mode: How keywords match (‘any’ or ‘all’)
context_depth: Conversation context depth
status: Enable/disable
weight: Priority weight

Keyword Triggering:

Evaluation sets support keyword-based triggering in addition to tag-based routing:

query_keywords: Keywords checked against user input (pre-response)
response_keywords: Keywords checked against AI output (post-response)
exclude_query_keywords: Keywords that skip evaluation in user input
exclude_response_keywords: Keywords that skip evaluation in AI output
keyword_match_mode: ‘any’ (at least one matches) or ‘all’ (all must match)
hasKeywords(): Returns TRUE if query or response keywords are defined
matchesQuery(): Checks if query text matches keywords
matchesResponse(): Checks if response text matches keywords

Evaluation Set Selection Flow: The system uses a fall-through mechanism to select the best matching evaluation set:

Global Query Exclusion (Circuit Breaker): If matched, aborts ALL evaluations immediately
Candidate Identification: Filters enabled sets by operation type and tags
- Sets returned sorted by weight (lowest first)
- Empty tags match all requests
Hook Filtering: Modules can remove sets from candidate list via hook_ai_autoevals_evaluation_sets_alter()
Per-Set Keyword Matching (Fall-through): Iterates through candidates in weight order:
- If set has exclusion keywords: Check query, skip if matched
- If set has trigger keywords: Check query, skip if NOT matched
- If set has no trigger keywords: Match automatically
- First set that passes is selected

Example Fall-Through Behavior:

Set A (weight 0): Has query keywords “weather”
Set B (weight 10): No query keywords (catch-all)

Query: “What is the time?”

Set A checked first → Fails keyword check → Skip
Set B checked next → Passes (no keywords) → Selected

Query: “What is the weather like?”

Set A checked first → Matches keywords → Selected

Keyword Priority:

Global exclusion keywords (circuit breaker - applies to all sets)
Per-set exclusion keywords
Trigger keywords (if defined, empty = match all)

Keyword Priority:

Global exclusion keywords (highest - applies to all sets)
Per-set exclusion keywords
Trigger keywords (lowest)

Builder Pattern

The EvaluationSetBuilder class provides a fluent API for programmatically creating evaluation sets:

use Drupal\ai_autoevals\Entity\EvaluationSet;

$set = EvaluationSet::builder('weather_eval', 'Weather Evaluation')
  ->withDescription('Evaluates weather-related AI responses')
  ->forOperations(['chat'])
  ->triggerOnKeywords(['weather', 'forecast', 'temperature'], [])
  ->excludeOnKeywords(['test', 'debug', 'mock'], [])
  ->withFactExtractionMethod('ai_generated')
  ->withContextDepth(3)
  ->build();

Available Methods:

withDescription(string $description): Set description
forOperations(array $types): Set operation types
withTags(array $tags): Set required tags
triggerOnKeywords(array $queryKeywords, array $responseKeywords = []): Set trigger keywords
excludeOnKeywords(array $queryKeywords, array $responseKeywords = []): Set exclusion keywords
withKeywordMatchMode(string $mode): Set ‘any’ or ‘all’ mode
withFactExtractionMethod(string $method): Set extraction method
withContextDepth(int $depth): Set context depth
withCustomKnowledge(string $knowledge): Set domain knowledge
withCustomPromptTemplate(string $template): Set custom prompt
withChoiceScores(array $scores): Set scoring mapping
withWeight(int $weight): Set priority weight
enabled(bool $enabled = TRUE): Set enabled status
disabled(): Disable the set
build(): Create and save the set
buildWithoutSaving(): Create without saving

Plugin System

Fact Extractor Plugins

Plugin Manager: plugin.manager.ai_autoevals.fact_extractor

Base class: FactExtractorPluginBase

Interface: FactExtractorPluginInterface

Built-in Plugins:

ai_generated: AI-powered fact extraction
keyword: Keyword-based extraction
regex: Regex pattern-based extraction
hybrid: Combines multiple methods

Creating Custom Plugins:

<?php

namespace Drupal\my_module\Plugin\FactExtractor;

use Drupal\ai_autoevals\Plugin\FactExtractor\FactExtractorPluginBase;

/**
 * @FactExtractor(
 *   id = "my_custom",
 *   label = @Translation("My Custom Extractor"),
 *   description = @Translation("Custom fact extraction logic.")
 * )
 */
class MyCustomExtractor extends FactExtractorPluginBase {
  public function extract(string $input, array $context = []): array {
    // Custom extraction logic
    return [];
  }
}

Event System

Hook for Filtering Evaluation Sets

Hook Name: hook_ai_autoevals_evaluation_sets_alter()

When: During evaluation matching (pre-response), after operation type and tags filtering, before keyword matching
Use Cases: Conditional evaluation based on language, user roles, content type, or custom business rules
Access: Can remove evaluation sets from array to prevent them from being used
Location: Invoked in EvaluationManager::getMatchingEvaluationSetWithHook()

Example Use Cases:

Only evaluate English-language content
Restrict evaluations to specific user roles
Skip evaluation for sensitive content
Implement complex routing logic based on multiple factors

See Extending > Hooks for complete documentation.

Events Dispatched

1. PreEvaluationEvent

Name: ai_autoevals.pre_evaluation
When: Before evaluation is sent to LLM
Use Cases: Modify facts, skip evaluation, add metadata

2. PostEvaluationEvent

Name: ai_autoevals.post_evaluation
When: After evaluation completes successfully
Use Cases: Content moderation, notifications, analytics

3. EvaluationFailedEvent

Name: ai_autoevals.evaluation_failed
When: When evaluation fails
Use Cases: Retry logic, alerting, error tracking

See Event System documentation for details.

Queue Processing

Evaluation Queue

Queue ID: ai_autoevals_evaluation_worker

Evaluations are processed asynchronously via Drupal Queue API.

Worker Class: Drupal\ai_autoevals\Plugin\QueueWorker\EvaluationQueueWorker

Processing Flow:

Load evaluation entity
Find matching evaluation set
Dispatch PreEvaluationEvent
Extract facts using FactExtractor
Evaluate response using Evaluator
Update evaluation entity with results
Dispatch PostEvaluationEvent
Catch errors and dispatch EvaluationFailedEvent

Time Limit: 60 seconds per cron run

Caching

Fact Extraction Cache

Cache Bin: cache.ai_autoevals_facts

Fact extraction results are cached to improve performance and reduce API calls.

Cache Key: Based on input hash and evaluation set ID

Cache Tags: ai_autoevals:facts:{evaluation_set_id}

Clear cache when:

Evaluation set is modified
Custom knowledge is updated

Dependencies

Required Modules

Drupal 10.2+ / Drupal 11
AI module: Provides AI provider abstraction
Key module: Manages API keys securely

External Services

AI Provider: OpenAI, Anthropic, or compatible provider
LLM for Evaluation: Configurable, typically same as AI provider

Security Considerations

API Keys: Stored securely using Key module
User Input: All input is sanitized and validated
Rate Limiting: Respect provider rate limits
Data Retention: Configurable retention period
Access Control: Role-based permissions for all operations

Performance Considerations

Async Processing: Evaluations processed via queue to avoid blocking
Caching: Fact extraction results cached
Batch Operations: Efficient batch processing for re-evaluations
Database Indexing: Indexed fields for efficient queries
Queue Prioritization: Process evaluations in FIFO order

Extensibility

The module is designed to be extensible:

Custom Fact Extractors: Create plugins for specialized extraction
Custom Events: React to evaluation lifecycle events
Custom Prompts: Override evaluation prompts per configuration
Custom Scoring: Customize scoring per evaluation set
Custom Integrations: Integrate with moderation systems, observability, etc.

Next Steps

API Reference - Detailed service documentation
Event System - Event system guide
Plugin Development - Create custom plugins
Extending the Module - Extension guide

Architecture

Overview

System Flow

Preventing Infinite Evaluation Loops

The ai_autoevals:internal Tag

Available Tags

Core Components

Configuration Services

AiAutoevalsConfig

KeywordMatcher

Core Services

1. Evaluation Manager

2. Fact Extractor

3. Evaluator

4. Conversation Tracker

5. Batch Processor

6. Event Subscriber

Data Model

EvaluationResult Entity

EvaluationSet Entity

Builder Pattern

Plugin System

Fact Extractor Plugins

Event System

Hook for Filtering Evaluation Sets

Events Dispatched

Queue Processing

Evaluation Queue

Caching

Fact Extraction Cache

Dependencies

Required Modules

External Services

Security Considerations

Performance Considerations

Extensibility

Next Steps

The `ai_autoevals:internal` Tag