Skip to content
Secure Private AI for Enterprises and Developers - amazee.ai

Architecture

This guide explains the architecture and key components of the AI AutoEvals module.

AI AutoEvals is built around a two-step LLM evaluation process:

  1. Fact Extraction: Analyze the user’s question to determine what a correct answer should contain
  2. Response Evaluation: Use a second LLM call to compare the AI response against extracted criteria

The key insight is that evaluation criteria are derived solely from the user’s question and context, not from the AI response itself. This avoids evaluation bias and ensures objective factuality checking.

AI Request
[Event Subscriber] PreGenerateResponseEvent
Check: ai_autoevals:internal tag?
↓ Yes → Skip (internal AI request)
↓ No
Check: Operation type configured?
↓ No → Skip
↓ Yes
Check: Auto-track OR ai_autoevals:track tag?
↓ No → Skip
↓ Yes
Find matching Evaluation Set
1. Check: Global query exclusion keywords? (Circuit Breaker)
↓ Match → Abort ALL evaluations (highest priority)
↓ No match
2. Identify Candidates:
- Get all enabled sets sorted by weight (lowest first)
- Filter by operation type AND tags
- Empty tags = match all requests
3. [Hook] Invoke hook_ai_autoevals_evaluation_sets_alter()
- Modules can remove sets from candidates
- Context: operation_type, tags, input_text, output_text
4. Check: Any evaluation sets remain?
↓ No → Skip
↓ Yes
5. Iterate candidates in weight order (Fall-through logic):
For each candidate set:
a. Check: Per-set query exclusion keywords?
↓ Match → Skip THIS set, try next
↓ No match
b. Check: Query trigger keywords?
↓ No match → Skip THIS set, try next
↓ Yes → SELECT THIS SET (winner)
c. If no keywords defined → SELECT THIS SET (winner)
6. If all candidates exhausted → No evaluation
[Conversation Tracker] Track request context
Store pending evaluation
AI Response Generated
[Event Subscriber] PostGenerateResponseEvent
Check: Global response exclusion keywords?
↓ Match → Skip ALL evaluations (highest priority)
↓ No match
Check: Per-set response exclusion keywords?
↓ Match → Skip this set (second priority)
↓ No match
Check: Response trigger keywords match? (if defined)
↓ No → Skip
↓ Yes
[Evaluation Manager] Create evaluation entity
[Queue Worker] Process async
[Fact Extractor] Extract evaluation criteria (AI request tagged with ai_autoevals:internal)
[Evaluator] Evaluate response against criteria (AI request tagged with ai_autoevals:internal)
[Event Dispatcher] Dispatch PostEvaluationEvent
[Database] Store result

The evaluation process requires making additional AI requests:

  • Fact Extraction: Extract evaluation criteria from user input
  • Response Evaluation: Evaluate AI response against criteria

These internal AI requests would normally trigger the AutoEvals system again, creating an infinite loop of evaluations evaluating evaluations.

To prevent this, the module uses the ai_autoevals:internal tag:

When Adding Internal Requests:

  • FactExtractor and Evaluator services add ['ai_autoevals:internal'] to all AI requests
  • Example: ->chat($input, $modelId, ['ai_autoevals:internal'])

When Checking Requests:

  • The event subscriber checks for this tag first in onPreGenerateResponse()
  • If present, the request is immediately skipped, preventing recursive evaluation

Why This Matters:

  • Prevents infinite loops and resource exhaustion
  • Separates user requests from internal evaluation requests
  • Ensures evaluations don’t generate evaluations
  • Maintains system stability and performance
TagPurposeAdded By
ai_autoevals:internalMarks internal AI requests (skipped from evaluation)Module internals
ai_autoevals:trackRequests manual evaluation when auto-tracking is disabledYour code

The ai_autoevals:internal tag is automatically added by the module and should not be added manually.

Service ID: ai_autoevals.config

Centralized configuration service for accessing module settings and AI provider configuration.

Responsibilities:

  • Access default AI provider and model settings
  • Retrieve global configuration values
  • Provide fallback to system defaults
  • Check configuration status

Key Methods:

  • getProviderId(): Get configured AI provider
  • getModelId(): Get configured AI model
  • getProvider(): Get AI provider instance
  • isConfigured(): Check if provider is configured
  • getGlobalExcludeQueryKeywords(): Get global query exclusions
  • getGlobalExcludeResponseKeywords(): Get global response exclusions
  • getOperationTypes(): Get configured operation types
  • isAutoTrackEnabled(): Check auto-track status
  • isDebugMode(): Check debug mode

Service ID: ai_autoevals.keyword_matcher

Reusable service for keyword matching logic used throughout the module.

Responsibilities:

  • Match keywords in text with case-insensitive comparison
  • Support ‘any’ and ‘all’ match modes
  • Normalize and validate keywords

Key Methods:

  • matchesAny(string $text, array $keywords): Check if any keyword matches
  • matchesAll(string $text, array $keywords): Check if all keywords match
  • matches(string $text, array $keywords, string $mode): Generic matching method

Service ID: ai_autoevals.evaluation_manager

The central service that coordinates the evaluation lifecycle.

Responsibilities:

  • Create and manage evaluation entities
  • Queue evaluations for processing
  • Track evaluation status
  • Retrieve evaluation history and statistics
  • Route evaluations to matching evaluation sets

Key Methods:

  • createEvaluation(): Create new evaluation
  • queueEvaluation(): Queue for processing
  • getMatchingEvaluationSet(): Find matching configuration (legacy, doesn’t check keywords)
  • getMatchingEvaluationSetWithHook(): Find matching configuration with hook support (recommended method)
  • getStatistics(): Get dashboard statistics

Service ID: ai_autoevals.fact_extractor

Extracts evaluation criteria from user input using pluggable strategies.

Responsibilities:

  • Analyze user question to identify key facts
  • Generate evaluation criteria
  • Use custom knowledge for domain-specific extraction
  • Cache extraction results

Plugin Types:

  • AI Generated: Uses LLM to extract facts
  • Rule-Based: Uses patterns and rules
  • Hybrid: Combines AI and rule-based methods
  • Custom: Custom fact extractor plugins

Key Methods:

  • extractFacts(): Extract criteria from input
  • selectPlugin(): Select appropriate extraction plugin

Service ID: ai_autoevals.evaluator

Evaluates AI responses against extracted criteria.

Responsibilities:

  • Load evaluation prompt template
  • Construct evaluation prompt with facts and response
  • Call LLM for evaluation
  • Parse response to extract choice and analysis
  • Calculate score based on choice

Key Methods:

  • evaluate(): Perform evaluation
  • loadPromptTemplate(): Load custom prompt
  • parseResponse(): Parse LLM response
  • calculateScore(): Calculate score from choice

Service ID: ai_autoevals.conversation_tracker

Maintains conversation context across multi-turn interactions.

Responsibilities:

  • Track conversation threads
  • Maintain parent-child relationships
  • Retrieve conversation context
  • Clear conversation data

Key Methods:

  • trackConversation(): Track a conversation turn
  • getConversationContext(): Retrieve context
  • isFollowUp(): Check if request is a follow-up
  • getThreadRoot(): Find thread root

Service ID: ai_autoevals.batch_processor

Handles batch operations on evaluations.

Responsibilities:

  • Re-evaluate multiple evaluations
  • Compare evaluation configurations
  • Requeue failed evaluations
  • Schedule batch re-evaluations

Key Methods:

  • reEvaluateBatch(): Re-evaluate multiple items
  • compareConfigurations(): Compare sets
  • requeueAllFailed(): Requeue all failed

Class: Drupal\ai_autoevals\EventSubscriber\AiAutoevalsSubscriber

Listens to AI module events and triggers evaluations.

Responsibilities:

  • Listen to ai.request.post_generate_response events
  • Check if evaluation should be triggered using KeywordMatcher
  • Use AiAutoevalsConfig for configuration access
  • Create evaluation entities
  • Queue evaluations

Key Features:

  • Uses ai_autoevals.config for centralized configuration access
  • Uses ai_autoevals.keyword_matcher for all keyword matching logic
  • Implements exclusion keyword priority (global > per-set > trigger)

Entity Type ID: ai_autoevals_evaluation_result

Content entity storing evaluation results.

Key Fields:

  • evaluation_set_id: Reference to evaluation set configuration
  • request_id: Unique identifier for the AI request
  • request_parent_id: Parent request ID for conversation tracking
  • provider_id: AI provider used
  • model_id: AI model used
  • operation_type: Type of operation (chat, chat_completion)
  • input: User’s input/question
  • output: AI’s response
  • facts: Extracted evaluation criteria (JSON)
  • status: Evaluation status (pending, processing, completed, failed)
  • score: Final score (0.0 - 1.0)
  • choice: Evaluation choice (A, B, C, D)
  • analysis: LLM’s analysis
  • tags: Associated tags (JSON)
  • metadata: Additional metadata (JSON)

Entity Type ID: ai_autoevals_evaluation_set

Config entity storing evaluation configurations.

Key Fields:

  • label: Configuration name
  • description: Configuration description
  • operation_types: Operations to evaluate
  • fact_extraction_method: Method for extracting facts
  • custom_knowledge: Domain-specific knowledge
  • prompt_template_id: Custom prompt template
  • custom_prompt_template: Custom prompt override
  • choice_scores: Scoring for each choice (JSON)
  • tags: Tag filters (JSON)
  • query_keywords: Keywords to match in user queries (array of strings)
  • response_keywords: Keywords to match in AI responses (array of strings)
  • keyword_match_mode: How keywords match (‘any’ or ‘all’)
  • context_depth: Conversation context depth
  • status: Enable/disable
  • weight: Priority weight

Keyword Triggering:

Evaluation sets support keyword-based triggering in addition to tag-based routing:

  • query_keywords: Keywords checked against user input (pre-response)
  • response_keywords: Keywords checked against AI output (post-response)
  • exclude_query_keywords: Keywords that skip evaluation in user input
  • exclude_response_keywords: Keywords that skip evaluation in AI output
  • keyword_match_mode: ‘any’ (at least one matches) or ‘all’ (all must match)
  • hasKeywords(): Returns TRUE if query or response keywords are defined
  • matchesQuery(): Checks if query text matches keywords
  • matchesResponse(): Checks if response text matches keywords

Evaluation Set Selection Flow: The system uses a fall-through mechanism to select the best matching evaluation set:

  1. Global Query Exclusion (Circuit Breaker): If matched, aborts ALL evaluations immediately
  2. Candidate Identification: Filters enabled sets by operation type and tags
    • Sets returned sorted by weight (lowest first)
    • Empty tags match all requests
  3. Hook Filtering: Modules can remove sets from candidate list via hook_ai_autoevals_evaluation_sets_alter()
  4. Per-Set Keyword Matching (Fall-through): Iterates through candidates in weight order:
    • If set has exclusion keywords: Check query, skip if matched
    • If set has trigger keywords: Check query, skip if NOT matched
    • If set has no trigger keywords: Match automatically
    • First set that passes is selected

Example Fall-Through Behavior:

  • Set A (weight 0): Has query keywords “weather”
  • Set B (weight 10): No query keywords (catch-all)

Query: “What is the time?”

  • Set A checked first → Fails keyword check → Skip
  • Set B checked next → Passes (no keywords) → Selected

Query: “What is the weather like?”

  • Set A checked first → Matches keywords → Selected

Keyword Priority:

  1. Global exclusion keywords (circuit breaker - applies to all sets)
  2. Per-set exclusion keywords
  3. Trigger keywords (if defined, empty = match all)

Keyword Priority:

  1. Global exclusion keywords (highest - applies to all sets)
  2. Per-set exclusion keywords
  3. Trigger keywords (lowest)

The EvaluationSetBuilder class provides a fluent API for programmatically creating evaluation sets:

use Drupal\ai_autoevals\Entity\EvaluationSet;
$set = EvaluationSet::builder('weather_eval', 'Weather Evaluation')
->withDescription('Evaluates weather-related AI responses')
->forOperations(['chat'])
->triggerOnKeywords(['weather', 'forecast', 'temperature'], [])
->excludeOnKeywords(['test', 'debug', 'mock'], [])
->withFactExtractionMethod('ai_generated')
->withContextDepth(3)
->build();

Available Methods:

  • withDescription(string $description): Set description
  • forOperations(array $types): Set operation types
  • withTags(array $tags): Set required tags
  • triggerOnKeywords(array $queryKeywords, array $responseKeywords = []): Set trigger keywords
  • excludeOnKeywords(array $queryKeywords, array $responseKeywords = []): Set exclusion keywords
  • withKeywordMatchMode(string $mode): Set ‘any’ or ‘all’ mode
  • withFactExtractionMethod(string $method): Set extraction method
  • withContextDepth(int $depth): Set context depth
  • withCustomKnowledge(string $knowledge): Set domain knowledge
  • withCustomPromptTemplate(string $template): Set custom prompt
  • withChoiceScores(array $scores): Set scoring mapping
  • withWeight(int $weight): Set priority weight
  • enabled(bool $enabled = TRUE): Set enabled status
  • disabled(): Disable the set
  • build(): Create and save the set
  • buildWithoutSaving(): Create without saving

Plugin Manager: plugin.manager.ai_autoevals.fact_extractor

Base class: FactExtractorPluginBase

Interface: FactExtractorPluginInterface

Built-in Plugins:

  • ai_generated: AI-powered fact extraction
  • keyword: Keyword-based extraction
  • regex: Regex pattern-based extraction
  • hybrid: Combines multiple methods

Creating Custom Plugins:

<?php
namespace Drupal\my_module\Plugin\FactExtractor;
use Drupal\ai_autoevals\Plugin\FactExtractor\FactExtractorPluginBase;
/**
* @FactExtractor(
* id = "my_custom",
* label = @Translation("My Custom Extractor"),
* description = @Translation("Custom fact extraction logic.")
* )
*/
class MyCustomExtractor extends FactExtractorPluginBase {
public function extract(string $input, array $context = []): array {
// Custom extraction logic
return [];
}
}

Hook Name: hook_ai_autoevals_evaluation_sets_alter()

  • When: During evaluation matching (pre-response), after operation type and tags filtering, before keyword matching
  • Use Cases: Conditional evaluation based on language, user roles, content type, or custom business rules
  • Access: Can remove evaluation sets from array to prevent them from being used
  • Location: Invoked in EvaluationManager::getMatchingEvaluationSetWithHook()

Example Use Cases:

  • Only evaluate English-language content
  • Restrict evaluations to specific user roles
  • Skip evaluation for sensitive content
  • Implement complex routing logic based on multiple factors

See Extending > Hooks for complete documentation.

1. PreEvaluationEvent

  • Name: ai_autoevals.pre_evaluation
  • When: Before evaluation is sent to LLM
  • Use Cases: Modify facts, skip evaluation, add metadata

2. PostEvaluationEvent

  • Name: ai_autoevals.post_evaluation
  • When: After evaluation completes successfully
  • Use Cases: Content moderation, notifications, analytics

3. EvaluationFailedEvent

  • Name: ai_autoevals.evaluation_failed
  • When: When evaluation fails
  • Use Cases: Retry logic, alerting, error tracking

See Event System documentation for details.

Queue ID: ai_autoevals_evaluation_worker

Evaluations are processed asynchronously via Drupal Queue API.

Worker Class: Drupal\ai_autoevals\Plugin\QueueWorker\EvaluationQueueWorker

Processing Flow:

  1. Load evaluation entity
  2. Find matching evaluation set
  3. Dispatch PreEvaluationEvent
  4. Extract facts using FactExtractor
  5. Evaluate response using Evaluator
  6. Update evaluation entity with results
  7. Dispatch PostEvaluationEvent
  8. Catch errors and dispatch EvaluationFailedEvent

Time Limit: 60 seconds per cron run

Cache Bin: cache.ai_autoevals_facts

Fact extraction results are cached to improve performance and reduce API calls.

Cache Key: Based on input hash and evaluation set ID

Cache Tags: ai_autoevals:facts:{evaluation_set_id}

Clear cache when:

  • Evaluation set is modified
  • Custom knowledge is updated
  • Drupal 10.2+ / Drupal 11
  • AI module: Provides AI provider abstraction
  • Key module: Manages API keys securely
  • AI Provider: OpenAI, Anthropic, or compatible provider
  • LLM for Evaluation: Configurable, typically same as AI provider
  1. API Keys: Stored securely using Key module
  2. User Input: All input is sanitized and validated
  3. Rate Limiting: Respect provider rate limits
  4. Data Retention: Configurable retention period
  5. Access Control: Role-based permissions for all operations
  1. Async Processing: Evaluations processed via queue to avoid blocking
  2. Caching: Fact extraction results cached
  3. Batch Operations: Efficient batch processing for re-evaluations
  4. Database Indexing: Indexed fields for efficient queries
  5. Queue Prioritization: Process evaluations in FIFO order

The module is designed to be extensible:

  • Custom Fact Extractors: Create plugins for specialized extraction
  • Custom Events: React to evaluation lifecycle events
  • Custom Prompts: Override evaluation prompts per configuration
  • Custom Scoring: Customize scoring per evaluation set
  • Custom Integrations: Integrate with moderation systems, observability, etc.