Architecture
This guide explains the architecture and key components of the AI AutoEvals module.
Overview
Section titled “Overview”AI AutoEvals is built around a two-step LLM evaluation process:
- Fact Extraction: Analyze the user’s question to determine what a correct answer should contain
- Response Evaluation: Use a second LLM call to compare the AI response against extracted criteria
The key insight is that evaluation criteria are derived solely from the user’s question and context, not from the AI response itself. This avoids evaluation bias and ensures objective factuality checking.
System Flow
Section titled “System Flow” AI Request ↓ [Event Subscriber] PreGenerateResponseEvent ↓ Check: ai_autoevals:internal tag? ↓ Yes → Skip (internal AI request) ↓ No Check: Operation type configured? ↓ No → Skip ↓ Yes Check: Auto-track OR ai_autoevals:track tag? ↓ No → Skip ↓ Yes Find matching Evaluation Set ↓ 1. Check: Global query exclusion keywords? (Circuit Breaker) ↓ Match → Abort ALL evaluations (highest priority) ↓ No match 2. Identify Candidates: - Get all enabled sets sorted by weight (lowest first) - Filter by operation type AND tags - Empty tags = match all requests ↓ 3. [Hook] Invoke hook_ai_autoevals_evaluation_sets_alter() - Modules can remove sets from candidates - Context: operation_type, tags, input_text, output_text ↓ 4. Check: Any evaluation sets remain? ↓ No → Skip ↓ Yes 5. Iterate candidates in weight order (Fall-through logic): For each candidate set: a. Check: Per-set query exclusion keywords? ↓ Match → Skip THIS set, try next ↓ No match b. Check: Query trigger keywords? ↓ No match → Skip THIS set, try next ↓ Yes → SELECT THIS SET (winner) c. If no keywords defined → SELECT THIS SET (winner) ↓ 6. If all candidates exhausted → No evaluation ↓ [Conversation Tracker] Track request context ↓ Store pending evaluation ↓ AI Response Generated ↓ [Event Subscriber] PostGenerateResponseEvent ↓ Check: Global response exclusion keywords? ↓ Match → Skip ALL evaluations (highest priority) ↓ No match Check: Per-set response exclusion keywords? ↓ Match → Skip this set (second priority) ↓ No match Check: Response trigger keywords match? (if defined) ↓ No → Skip ↓ Yes [Evaluation Manager] Create evaluation entity ↓ [Queue Worker] Process async ↓ [Fact Extractor] Extract evaluation criteria (AI request tagged with ai_autoevals:internal) ↓ [Evaluator] Evaluate response against criteria (AI request tagged with ai_autoevals:internal) ↓ [Event Dispatcher] Dispatch PostEvaluationEvent ↓ [Database] Store resultPreventing Infinite Evaluation Loops
Section titled “Preventing Infinite Evaluation Loops”The ai_autoevals:internal Tag
Section titled “The ai_autoevals:internal Tag”The evaluation process requires making additional AI requests:
- Fact Extraction: Extract evaluation criteria from user input
- Response Evaluation: Evaluate AI response against criteria
These internal AI requests would normally trigger the AutoEvals system again, creating an infinite loop of evaluations evaluating evaluations.
To prevent this, the module uses the ai_autoevals:internal tag:
When Adding Internal Requests:
- FactExtractor and Evaluator services add
['ai_autoevals:internal']to all AI requests - Example:
->chat($input, $modelId, ['ai_autoevals:internal'])
When Checking Requests:
- The event subscriber checks for this tag first in
onPreGenerateResponse() - If present, the request is immediately skipped, preventing recursive evaluation
Why This Matters:
- Prevents infinite loops and resource exhaustion
- Separates user requests from internal evaluation requests
- Ensures evaluations don’t generate evaluations
- Maintains system stability and performance
Available Tags
Section titled “Available Tags”| Tag | Purpose | Added By |
|---|---|---|
ai_autoevals:internal | Marks internal AI requests (skipped from evaluation) | Module internals |
ai_autoevals:track | Requests manual evaluation when auto-tracking is disabled | Your code |
The ai_autoevals:internal tag is automatically added by the module and should not be added manually.
Core Components
Section titled “Core Components”Configuration Services
Section titled “Configuration Services”AiAutoevalsConfig
Section titled “AiAutoevalsConfig”Service ID: ai_autoevals.config
Centralized configuration service for accessing module settings and AI provider configuration.
Responsibilities:
- Access default AI provider and model settings
- Retrieve global configuration values
- Provide fallback to system defaults
- Check configuration status
Key Methods:
getProviderId(): Get configured AI providergetModelId(): Get configured AI modelgetProvider(): Get AI provider instanceisConfigured(): Check if provider is configuredgetGlobalExcludeQueryKeywords(): Get global query exclusionsgetGlobalExcludeResponseKeywords(): Get global response exclusionsgetOperationTypes(): Get configured operation typesisAutoTrackEnabled(): Check auto-track statusisDebugMode(): Check debug mode
KeywordMatcher
Section titled “KeywordMatcher”Service ID: ai_autoevals.keyword_matcher
Reusable service for keyword matching logic used throughout the module.
Responsibilities:
- Match keywords in text with case-insensitive comparison
- Support ‘any’ and ‘all’ match modes
- Normalize and validate keywords
Key Methods:
matchesAny(string $text, array $keywords): Check if any keyword matchesmatchesAll(string $text, array $keywords): Check if all keywords matchmatches(string $text, array $keywords, string $mode): Generic matching method
Core Services
Section titled “Core Services”1. Evaluation Manager
Section titled “1. Evaluation Manager”Service ID: ai_autoevals.evaluation_manager
The central service that coordinates the evaluation lifecycle.
Responsibilities:
- Create and manage evaluation entities
- Queue evaluations for processing
- Track evaluation status
- Retrieve evaluation history and statistics
- Route evaluations to matching evaluation sets
Key Methods:
createEvaluation(): Create new evaluationqueueEvaluation(): Queue for processinggetMatchingEvaluationSet(): Find matching configuration (legacy, doesn’t check keywords)getMatchingEvaluationSetWithHook(): Find matching configuration with hook support (recommended method)getStatistics(): Get dashboard statistics
2. Fact Extractor
Section titled “2. Fact Extractor”Service ID: ai_autoevals.fact_extractor
Extracts evaluation criteria from user input using pluggable strategies.
Responsibilities:
- Analyze user question to identify key facts
- Generate evaluation criteria
- Use custom knowledge for domain-specific extraction
- Cache extraction results
Plugin Types:
- AI Generated: Uses LLM to extract facts
- Rule-Based: Uses patterns and rules
- Hybrid: Combines AI and rule-based methods
- Custom: Custom fact extractor plugins
Key Methods:
extractFacts(): Extract criteria from inputselectPlugin(): Select appropriate extraction plugin
3. Evaluator
Section titled “3. Evaluator”Service ID: ai_autoevals.evaluator
Evaluates AI responses against extracted criteria.
Responsibilities:
- Load evaluation prompt template
- Construct evaluation prompt with facts and response
- Call LLM for evaluation
- Parse response to extract choice and analysis
- Calculate score based on choice
Key Methods:
evaluate(): Perform evaluationloadPromptTemplate(): Load custom promptparseResponse(): Parse LLM responsecalculateScore(): Calculate score from choice
4. Conversation Tracker
Section titled “4. Conversation Tracker”Service ID: ai_autoevals.conversation_tracker
Maintains conversation context across multi-turn interactions.
Responsibilities:
- Track conversation threads
- Maintain parent-child relationships
- Retrieve conversation context
- Clear conversation data
Key Methods:
trackConversation(): Track a conversation turngetConversationContext(): Retrieve contextisFollowUp(): Check if request is a follow-upgetThreadRoot(): Find thread root
5. Batch Processor
Section titled “5. Batch Processor”Service ID: ai_autoevals.batch_processor
Handles batch operations on evaluations.
Responsibilities:
- Re-evaluate multiple evaluations
- Compare evaluation configurations
- Requeue failed evaluations
- Schedule batch re-evaluations
Key Methods:
reEvaluateBatch(): Re-evaluate multiple itemscompareConfigurations(): Compare setsrequeueAllFailed(): Requeue all failed
6. Event Subscriber
Section titled “6. Event Subscriber”Class: Drupal\ai_autoevals\EventSubscriber\AiAutoevalsSubscriber
Listens to AI module events and triggers evaluations.
Responsibilities:
- Listen to
ai.request.post_generate_responseevents - Check if evaluation should be triggered using KeywordMatcher
- Use AiAutoevalsConfig for configuration access
- Create evaluation entities
- Queue evaluations
Key Features:
- Uses
ai_autoevals.configfor centralized configuration access - Uses
ai_autoevals.keyword_matcherfor all keyword matching logic - Implements exclusion keyword priority (global > per-set > trigger)
Data Model
Section titled “Data Model”EvaluationResult Entity
Section titled “EvaluationResult Entity”Entity Type ID: ai_autoevals_evaluation_result
Content entity storing evaluation results.
Key Fields:
evaluation_set_id: Reference to evaluation set configurationrequest_id: Unique identifier for the AI requestrequest_parent_id: Parent request ID for conversation trackingprovider_id: AI provider usedmodel_id: AI model usedoperation_type: Type of operation (chat, chat_completion)input: User’s input/questionoutput: AI’s responsefacts: Extracted evaluation criteria (JSON)status: Evaluation status (pending, processing, completed, failed)score: Final score (0.0 - 1.0)choice: Evaluation choice (A, B, C, D)analysis: LLM’s analysistags: Associated tags (JSON)metadata: Additional metadata (JSON)
EvaluationSet Entity
Section titled “EvaluationSet Entity”Entity Type ID: ai_autoevals_evaluation_set
Config entity storing evaluation configurations.
Key Fields:
label: Configuration namedescription: Configuration descriptionoperation_types: Operations to evaluatefact_extraction_method: Method for extracting factscustom_knowledge: Domain-specific knowledgeprompt_template_id: Custom prompt templatecustom_prompt_template: Custom prompt overridechoice_scores: Scoring for each choice (JSON)tags: Tag filters (JSON)query_keywords: Keywords to match in user queries (array of strings)response_keywords: Keywords to match in AI responses (array of strings)keyword_match_mode: How keywords match (‘any’ or ‘all’)context_depth: Conversation context depthstatus: Enable/disableweight: Priority weight
Keyword Triggering:
Evaluation sets support keyword-based triggering in addition to tag-based routing:
query_keywords: Keywords checked against user input (pre-response)response_keywords: Keywords checked against AI output (post-response)exclude_query_keywords: Keywords that skip evaluation in user inputexclude_response_keywords: Keywords that skip evaluation in AI outputkeyword_match_mode: ‘any’ (at least one matches) or ‘all’ (all must match)hasKeywords(): Returns TRUE if query or response keywords are definedmatchesQuery(): Checks if query text matches keywordsmatchesResponse(): Checks if response text matches keywords
Evaluation Set Selection Flow: The system uses a fall-through mechanism to select the best matching evaluation set:
- Global Query Exclusion (Circuit Breaker): If matched, aborts ALL evaluations immediately
- Candidate Identification: Filters enabled sets by operation type and tags
- Sets returned sorted by weight (lowest first)
- Empty tags match all requests
- Hook Filtering: Modules can remove sets from candidate list via
hook_ai_autoevals_evaluation_sets_alter() - Per-Set Keyword Matching (Fall-through): Iterates through candidates in weight order:
- If set has exclusion keywords: Check query, skip if matched
- If set has trigger keywords: Check query, skip if NOT matched
- If set has no trigger keywords: Match automatically
- First set that passes is selected
Example Fall-Through Behavior:
- Set A (weight 0): Has query keywords “weather”
- Set B (weight 10): No query keywords (catch-all)
Query: “What is the time?”
- Set A checked first → Fails keyword check → Skip
- Set B checked next → Passes (no keywords) → Selected
Query: “What is the weather like?”
- Set A checked first → Matches keywords → Selected
Keyword Priority:
- Global exclusion keywords (circuit breaker - applies to all sets)
- Per-set exclusion keywords
- Trigger keywords (if defined, empty = match all)
Keyword Priority:
- Global exclusion keywords (highest - applies to all sets)
- Per-set exclusion keywords
- Trigger keywords (lowest)
Builder Pattern
Section titled “Builder Pattern”The EvaluationSetBuilder class provides a fluent API for programmatically creating evaluation sets:
use Drupal\ai_autoevals\Entity\EvaluationSet;
$set = EvaluationSet::builder('weather_eval', 'Weather Evaluation') ->withDescription('Evaluates weather-related AI responses') ->forOperations(['chat']) ->triggerOnKeywords(['weather', 'forecast', 'temperature'], []) ->excludeOnKeywords(['test', 'debug', 'mock'], []) ->withFactExtractionMethod('ai_generated') ->withContextDepth(3) ->build();Available Methods:
withDescription(string $description): Set descriptionforOperations(array $types): Set operation typeswithTags(array $tags): Set required tagstriggerOnKeywords(array $queryKeywords, array $responseKeywords = []): Set trigger keywordsexcludeOnKeywords(array $queryKeywords, array $responseKeywords = []): Set exclusion keywordswithKeywordMatchMode(string $mode): Set ‘any’ or ‘all’ modewithFactExtractionMethod(string $method): Set extraction methodwithContextDepth(int $depth): Set context depthwithCustomKnowledge(string $knowledge): Set domain knowledgewithCustomPromptTemplate(string $template): Set custom promptwithChoiceScores(array $scores): Set scoring mappingwithWeight(int $weight): Set priority weightenabled(bool $enabled = TRUE): Set enabled statusdisabled(): Disable the setbuild(): Create and save the setbuildWithoutSaving(): Create without saving
Plugin System
Section titled “Plugin System”Fact Extractor Plugins
Section titled “Fact Extractor Plugins”Plugin Manager: plugin.manager.ai_autoevals.fact_extractor
Base class: FactExtractorPluginBase
Interface: FactExtractorPluginInterface
Built-in Plugins:
ai_generated: AI-powered fact extractionkeyword: Keyword-based extractionregex: Regex pattern-based extractionhybrid: Combines multiple methods
Creating Custom Plugins:
<?php
namespace Drupal\my_module\Plugin\FactExtractor;
use Drupal\ai_autoevals\Plugin\FactExtractor\FactExtractorPluginBase;
/** * @FactExtractor( * id = "my_custom", * label = @Translation("My Custom Extractor"), * description = @Translation("Custom fact extraction logic.") * ) */class MyCustomExtractor extends FactExtractorPluginBase { public function extract(string $input, array $context = []): array { // Custom extraction logic return []; }}Event System
Section titled “Event System”Hook for Filtering Evaluation Sets
Section titled “Hook for Filtering Evaluation Sets”Hook Name: hook_ai_autoevals_evaluation_sets_alter()
- When: During evaluation matching (pre-response), after operation type and tags filtering, before keyword matching
- Use Cases: Conditional evaluation based on language, user roles, content type, or custom business rules
- Access: Can remove evaluation sets from array to prevent them from being used
- Location: Invoked in
EvaluationManager::getMatchingEvaluationSetWithHook()
Example Use Cases:
- Only evaluate English-language content
- Restrict evaluations to specific user roles
- Skip evaluation for sensitive content
- Implement complex routing logic based on multiple factors
See Extending > Hooks for complete documentation.
Events Dispatched
Section titled “Events Dispatched”1. PreEvaluationEvent
- Name:
ai_autoevals.pre_evaluation - When: Before evaluation is sent to LLM
- Use Cases: Modify facts, skip evaluation, add metadata
2. PostEvaluationEvent
- Name:
ai_autoevals.post_evaluation - When: After evaluation completes successfully
- Use Cases: Content moderation, notifications, analytics
3. EvaluationFailedEvent
- Name:
ai_autoevals.evaluation_failed - When: When evaluation fails
- Use Cases: Retry logic, alerting, error tracking
See Event System documentation for details.
Queue Processing
Section titled “Queue Processing”Evaluation Queue
Section titled “Evaluation Queue”Queue ID: ai_autoevals_evaluation_worker
Evaluations are processed asynchronously via Drupal Queue API.
Worker Class: Drupal\ai_autoevals\Plugin\QueueWorker\EvaluationQueueWorker
Processing Flow:
- Load evaluation entity
- Find matching evaluation set
- Dispatch PreEvaluationEvent
- Extract facts using FactExtractor
- Evaluate response using Evaluator
- Update evaluation entity with results
- Dispatch PostEvaluationEvent
- Catch errors and dispatch EvaluationFailedEvent
Time Limit: 60 seconds per cron run
Caching
Section titled “Caching”Fact Extraction Cache
Section titled “Fact Extraction Cache”Cache Bin: cache.ai_autoevals_facts
Fact extraction results are cached to improve performance and reduce API calls.
Cache Key: Based on input hash and evaluation set ID
Cache Tags: ai_autoevals:facts:{evaluation_set_id}
Clear cache when:
- Evaluation set is modified
- Custom knowledge is updated
Dependencies
Section titled “Dependencies”Required Modules
Section titled “Required Modules”- Drupal 10.2+ / Drupal 11
- AI module: Provides AI provider abstraction
- Key module: Manages API keys securely
External Services
Section titled “External Services”- AI Provider: OpenAI, Anthropic, or compatible provider
- LLM for Evaluation: Configurable, typically same as AI provider
Security Considerations
Section titled “Security Considerations”- API Keys: Stored securely using Key module
- User Input: All input is sanitized and validated
- Rate Limiting: Respect provider rate limits
- Data Retention: Configurable retention period
- Access Control: Role-based permissions for all operations
Performance Considerations
Section titled “Performance Considerations”- Async Processing: Evaluations processed via queue to avoid blocking
- Caching: Fact extraction results cached
- Batch Operations: Efficient batch processing for re-evaluations
- Database Indexing: Indexed fields for efficient queries
- Queue Prioritization: Process evaluations in FIFO order
Extensibility
Section titled “Extensibility”The module is designed to be extensible:
- Custom Fact Extractors: Create plugins for specialized extraction
- Custom Events: React to evaluation lifecycle events
- Custom Prompts: Override evaluation prompts per configuration
- Custom Scoring: Customize scoring per evaluation set
- Custom Integrations: Integrate with moderation systems, observability, etc.
Next Steps
Section titled “Next Steps”- API Reference - Detailed service documentation
- Event System - Event system guide
- Plugin Development - Create custom plugins
- Extending the Module - Extension guide