AI Agent Assessment Framework
AI Agent Assessment Framework -- Complete Specification
Agent: api-architect
Domain: API Design, Assessment Architecture, Standards Alignment
Date: 2026-04-16
Table of Contents
- Executive Summary
- Standards Alignment Matrix
- Assessment Methodology
- Badge Design Specification
- Certification Process
- API Specification
- Database Schema
- Phase Breakdown
1. Executive Summary
The AI Agent Assessment Framework (AAAF) is a dual-axis scoring and certification system for AI agents. It produces a single badge displaying two independent scores:
- Performance Score: How well an agent executes (task completion, accuracy, speed, consistency, compliance).
- Capability Score: What an agent can do (domain breadth, complexity ceiling, tool proficiency, autonomy, learning rate, delegation, orchestration).
The system is designed for internal use within a multi-agent civilization (30+ agents) from day one, with architecture that supports external adoption by any organization running AI agents.
Design Principles
- Dual-axis independence: Performance and Capability are orthogonal. An agent can be a narrow expert with elite performance, or a versatile generalist with competent performance. Neither axis dominates.
- Evidence-based scoring: Every score must trace to observable, reproducible evidence. No subjective impressions without calibration.
- Standards-aligned: Maps to IEEE, ISO/IEC, and NIST frameworks from the start, not retrofitted later.
- Multi-tenant from day one: The data model, API, and scoring engine treat "organization" as a first-class concept, even when the only organization is the internal civilization.
2. Standards Alignment Matrix
Referenced Standards
| Standard | Full Title | Relevance |
|---|---|---|
| IEEE P2894 | Standard for AI Agent Interoperability | Agent capability description semantics; interoperability levels |
| ISO/IEC 22989:2022 | AI Concepts and Terminology | Canonical vocabulary for agent, system, model, transparency, explainability |
| ISO/IEC 42001:2023 | AI Management Systems | PDCA lifecycle, 38 controls, risk assessment, governance requirements |
| ISO/IEC 25059:2023 | Quality Model for AI Systems (SQuaRE) | Quality characteristics: accuracy, robustness, fairness, interpretability |
| NIST AI 100-1 | AI Risk Management Framework 1.0 | GOVERN/MAP/MEASURE/MANAGE functions; trustworthy AI characteristics |
| NIST AI Agent Standards Initiative (2026) | AI Agent Standards Initiative | Agent authentication, identity, security evaluation, interoperability profile |
| NIST AI 600-1 | Generative AI Profile | GenAI-specific risk categories and mitigations |
Dimension-to-Standard Mapping
| Assessment Dimension | Primary Standard | Specific Clause/Function |
|---|---|---|
| Performance: Task Completion Rate | NIST AI 100-1 | MEASURE 2.6 (valid and reliable) |
| Performance: Accuracy | ISO/IEC 25059 | Accuracy quality characteristic |
| Performance: Speed to Delivery | ISO/IEC 25059 | Time behavior sub-characteristic (performance efficiency) |
| Performance: Consistency | NIST AI 100-1 | MEASURE 2.6 (reliability across contexts) |
| Performance: Review Compliance | ISO/IEC 42001 | Control A.6.2.6 (monitoring), PDCA Act phase |
| Capability: Domain Breadth | IEEE P2894 | Capability description semantics |
| Capability: Complexity Ceiling | ISO/IEC 25059 | Functional suitability, completeness |
| Capability: Tool Proficiency | NIST Agent Initiative | Agent-tool interaction security/identity |
| Capability: Autonomy Level | ISO/IEC 22989 | Human-in/on/over-the-loop definitions |
| Capability: Learning Rate | ISO/IEC 25059 | Adaptability, continuous learning quality |
| Capability: Delegation | NIST Agent Initiative | Agent-to-agent communication, authorization |
| Capability: Orchestration | NIST Agent Initiative + IEEE P2894 | Multi-agent coordination, interoperability levels |
Compliance Checkpoints
The framework maps to ISO/IEC 42001 PDCA phases as follows:
| PDCA Phase | AAAF Activity |
|---|---|
| Plan | Define assessment criteria, select dimensions, weight configuration |
| Do | Execute assessment tasks, collect evidence, run scoring |
| Check | Validate scores against calibration baselines, compare to historical |
| Act | Issue badge, identify improvement areas, trigger re-assessment |
The framework maps to NIST AI RMF functions:
| NIST Function | AAAF Activity |
|---|---|
| GOVERN | Organization-level assessment policies, tier definitions, weight configs |
| MAP | Agent context mapping (domain, tools, autonomy level) |
| MEASURE | Evidence collection, scoring, grading |
| MANAGE | Badge issuance, certification lifecycle, re-assessment triggers |
3. Assessment Methodology
3.1 Performance Scoring
Performance measures how well an agent executes tasks it is given. All performance dimensions produce a normalized score from 0.0 to 1.0.
Dimension: Task Completion Rate (Weight: 25%)
Definition: Proportion of assigned tasks completed to acceptance criteria without human intervention.
Measurement Method:
- Track all tasks assigned to the agent over the assessment window.
- A task is "complete" when it meets its defined acceptance criteria (not merely when the agent reports it done).
- Partial completion receives fractional credit using a milestone-based formula.
Scoring Formula:
TCR = (fully_completed * 1.0 + partially_completed * milestone_fraction) / total_assigned
Evidence Required:
- Task log with assignment timestamp, completion timestamp, acceptance status.
- For partial completions: milestone checklist with individual milestone status.
Calibration: Baseline established from first 50 assessed tasks. Score is absolute, not curved.
Dimension: Accuracy (Weight: 25%)
Definition: Correctness of agent output, measured as inverse of revision rate and error density.
Measurement Method:
- Revision Rate: Proportion of completed tasks requiring revision after initial submission.
- Error Density: Number of substantive errors per 1000 output tokens (for text/code) or per task (for non-text outputs).
- Severity Weighting: Critical errors (functional breakage) weighted 3x. Minor errors (style, formatting) weighted 0.5x.
Scoring Formula:
accuracy_from_revisions = 1.0 - (revision_count / completed_task_count)
accuracy_from_errors = 1.0 - min(1.0, weighted_error_count / baseline_threshold)
Accuracy = 0.6 * accuracy_from_revisions + 0.4 * accuracy_from_errors
Evidence Required:
- Revision log with reason codes (critical/major/minor).
- Error annotations on sampled outputs (minimum 20% sample rate).
Grading Method: Code-based graders for code output (test suites, linters). Model-based graders with calibrated rubrics for text output (per Anthropic's recommendation of 0.80+ Spearman correlation with human evaluators).
Dimension: Speed to Delivery (Weight: 15%)
Definition: Time from task assignment to acceptable completion, relative to task complexity.
Measurement Method:
- Measure wall-clock time from assignment to acceptance.
- Normalize by estimated complexity (story points or t-shirt size).
- Score against complexity-specific baselines.
Scoring Formula:
normalized_time = actual_time / baseline_time_for_complexity
Speed = max(0, 1.0 - (normalized_time - 1.0)) // 1.0 at baseline, 0.0 at 2x baseline
Evidence Required:
- Timestamp pairs (assigned, completed) per task.
- Complexity classification per task.
- Baseline lookup table per complexity class.
Dimension: Consistency (Weight: 20%)
Definition: Variance in performance across repeated similar tasks.
Measurement Method:
- Group tasks by category/complexity.
- Measure coefficient of variation (CV) of accuracy and completion rate within each group.
- Lower CV equals higher consistency.
Scoring Formula:
cv_accuracy = std(accuracy_per_task_group) / mean(accuracy_per_task_group)
cv_completion = std(completion_per_task_group) / mean(completion_per_task_group)
Consistency = 1.0 - min(1.0, (0.6 * cv_accuracy + 0.4 * cv_completion))
Alignment: Maps to NIST AI RMF "reliable" characteristic -- consistent behavior across varied contexts.
Evidence Required:
- Per-task scores grouped by category.
- Minimum 5 tasks per category for statistical validity.
Dimension: Review Compliance (Weight: 15%)
Definition: Adherence to organizational review processes, output format standards, and verification protocols.
Measurement Method:
- Binary checklist per task: Did the agent follow the required process?
- Checklist items are configurable per organization. Internal default items:
- Memory search performed before work (yes/no)
- Output format matches template (yes/no)
- Verification evidence provided (yes/no)
- Memory written after significant work (yes/no)
Scoring Formula:
ReviewCompliance = checklist_items_passed / total_checklist_items // averaged across tasks
Alignment: Maps to ISO/IEC 42001 Control A.6.2.6 (monitoring and measurement of AI system performance).
Evidence Required:
- Checklist results per task.
- Spot-check audit of 10% of tasks for honest self-reporting.
Performance Composite Score
PerformanceScore = (0.25 * TCR) + (0.25 * Accuracy) + (0.15 * Speed) + (0.20 * Consistency) + (0.15 * ReviewCompliance)
Weights are configurable per organization. The above are defaults.
Performance Tier Mapping
| Tier | Score Range | Description |
|---|---|---|
| Novice | 0.00 -- 0.39 | Requires significant human oversight; frequent errors or incomplete tasks |
| Competent | 0.40 -- 0.59 | Handles routine tasks with occasional supervision needed |
| Proficient | 0.60 -- 0.74 | Reliable execution across standard tasks; minimal revision needed |
| Expert | 0.75 -- 0.89 | High accuracy, speed, and consistency; trusted for complex tasks |
| Elite | 0.90 -- 1.00 | Exceptional execution; sets the standard for the domain |
3.2 Capability Scoring
Capability measures what an agent can do, independent of how well it does it on any given task. All capability dimensions produce a normalized score from 0.0 to 1.0.
Dimension: Domain Breadth (Weight: 15%)
Definition: Number of distinct task domains the agent can operate in with at least Competent-level performance.
Measurement Method:
- Maintain a domain taxonomy (configurable per organization).
- Internal default: 12 domains (research, code, security, testing, design, documentation, communication, orchestration, analysis, content, infrastructure, legal).
- Agent must demonstrate Competent-level performance (>=0.40 on Performance Score) in a domain to claim it.
Scoring Formula:
DomainBreadth = qualified_domains / total_domains_in_taxonomy
Evidence Required:
- At least 3 assessed tasks per claimed domain.
- Performance score >= 0.40 in each claimed domain.
Dimension: Complexity Ceiling (Weight: 20%)
Definition: Maximum task complexity the agent can complete successfully.
Measurement Method:
- Tasks are classified on a 5-level complexity scale:
- L1: Single-step, well-defined (e.g., format conversion)
- L2: Multi-step, well-defined (e.g., code refactor with tests)
- L3: Multi-step, ambiguous requirements (e.g., design API from vague brief)
- L4: Cross-domain, requires judgment (e.g., security audit with architectural recommendations)
- L5: Novel, no prior template (e.g., design assessment framework from scratch)
- Complexity ceiling is the highest level at which the agent achieves >= 0.60 Performance Score.
Scoring Formula:
ComplexityCeiling = highest_passing_level / 5.0
Evidence Required:
- At least 3 assessed tasks at the claimed complexity level.
- Performance score >= 0.60 at that level.
Dimension: Tool Proficiency (Weight: 15%)
Definition: Effectiveness in using available tools (code execution, web search, file operations, APIs, databases).
Measurement Method:
- Maintain a tool inventory for the agent.
- For each tool, measure: correct invocation rate, error recovery rate, efficiency (unnecessary calls / total calls).
Scoring Formula:
per_tool_score = 0.5 * correct_invocation_rate + 0.3 * error_recovery_rate + 0.2 * (1 - unnecessary_call_rate)
ToolProficiency = mean(per_tool_score for all_tools)
Alignment: Maps to NIST Agent Standards Initiative research on agent-tool interaction security.
Evidence Required:
- Tool call logs with success/failure status.
- Minimum 10 invocations per tool for statistical validity.
Dimension: Autonomy Level (Weight: 10%)
Definition: Degree of independent operation the agent sustains without human intervention.
Measurement Method:
- Uses ISO/IEC 22989 terminology for human oversight levels:
- Level 0: Human-in-the-loop (every step approved)
- Level 1: Human-on-the-loop (periodic checkpoints)
- Level 2: Human-over-the-loop (exception-based intervention)
- Level 3: Fully autonomous (report-only)
- Measured as the highest autonomy level at which the agent maintains >= 0.60 Performance Score.
Scoring Formula:
AutonomyLevel = highest_passing_level / 3.0
Evidence Required:
- Assessment tasks run at each autonomy level.
- Performance scores at each level.
Dimension: Learning Rate (Weight: 10%)
Definition: Speed at which the agent improves on repeated exposure to similar task types.
Measurement Method:
- Compare Performance Score on first 5 tasks in a category vs. tasks 11-15 in the same category.
- Measures within-session improvement (if applicable) and cross-session improvement.
Scoring Formula:
improvement = performance_later - performance_earlier
LearningRate = min(1.0, max(0, improvement / 0.30)) // 0.30 improvement = perfect score
Note: A ceiling of 0.30 improvement normalizes the range. Agents already at Elite performance cannot improve much, so Learning Rate is weighted lower.
Evidence Required:
- Chronologically ordered task assessments within each category.
- Minimum 15 tasks per category for valid measurement.
Dimension: Delegation Capability (Weight: 15%)
Definition: Effectiveness in identifying when to delegate, selecting appropriate delegates, and managing delegated work.
Measurement Method:
- Three sub-dimensions:
- Delegation Judgment (40%): Did the agent correctly identify tasks that should be delegated vs. self-executed? Measured against expert-labeled ground truth.
- Delegate Selection (30%): Did the agent choose the optimal agent/resource for the delegated task? Measured against a capability matrix.
- Delegation Management (30%): Did delegated tasks complete successfully? What was the revision rate on delegated work?
Scoring Formula:
DelegationCapability = 0.4 * judgment_score + 0.3 * selection_score + 0.3 * management_score
Assessment Tasks:
- Present agent with a mixed workload of 20 tasks: some within its domain, some outside, some at the boundary.
- Measure: correct routing decisions, quality of delegation prompts, outcome tracking.
Alignment: Maps to NIST Agent Standards Initiative (agent-to-agent communication, authorization) and IEEE P2894 (agent capability descriptions enabling informed delegation).
Evidence Required:
- Delegation decision log with expert annotations.
- Delegate selection justification vs. capability matrix.
- Delegated task outcomes.
Dimension: Orchestration Skills (Weight: 15%)
Definition: Effectiveness in coordinating multi-agent workflows, managing dependencies, and synthesizing results.
Measurement Method:
- Four sub-dimensions:
- Workflow Design (25%): Quality of the coordination plan (parallelism, dependency ordering, resource efficiency).
- Communication Clarity (25%): Quality of prompts/instructions sent to coordinated agents.
- Failure Handling (25%): Recovery from individual agent failures without cascading collapse.
- Synthesis Quality (25%): Quality of the final integrated output from multi-agent work.
Scoring Formula:
Orchestration = 0.25 * workflow_design + 0.25 * communication_clarity + 0.25 * failure_handling + 0.25 * synthesis_quality
Assessment Tasks:
- Assign multi-agent coordination tasks requiring 3+ agents.
- Inject controlled failures (one agent returns errors, one returns incomplete work).
- Measure coordination quality and outcome quality.
Alignment: Maps to IEEE P2894 (interoperability levels), NIST Agent Standards Initiative (multi-agent security), ISO/IEC 42001 (system-level governance).
Evidence Required:
- Workflow plan artifacts.
- Inter-agent communication logs.
- Failure injection results.
- Final synthesized output quality scores.
Capability Composite Score
CapabilityScore = (0.15 * DomainBreadth) + (0.20 * ComplexityCeiling) + (0.15 * ToolProficiency) + (0.10 * AutonomyLevel) + (0.10 * LearningRate) + (0.15 * DelegationCapability) + (0.15 * Orchestration)
Weights are configurable per organization.
Capability Tier Mapping
| Tier | Score Range | Description |
|---|---|---|
| Narrow | 0.00 -- 0.29 | Single domain, low complexity, requires close supervision |
| Functional | 0.30 -- 0.49 | Few domains, moderate complexity, basic tool use |
| Versatile | 0.50 -- 0.69 | Multiple domains, handles ambiguity, effective delegation |
| Specialist | 0.70 -- 0.84 | Deep expertise + breadth, orchestrates others, high autonomy |
| Full-Stack | 0.85 -- 1.00 | Operates across all domains, orchestrates complex multi-agent workflows, fully autonomous |
3.3 Assessment Windows and Recency Weighting
- Assessment window: Rolling 30-day period (configurable).
- Recency decay: Tasks within the last 7 days weighted at 1.0x; 8-14 days at 0.8x; 15-30 days at 0.6x.
- Minimum sample size: 20 tasks for Performance, 30 tasks for Capability (due to wider dimension spread).
- Re-assessment trigger: Automatic when 50 new tasks accumulate since last assessment, or on-demand.
3.4 Grader Architecture
Following Anthropic's evaluation methodology, the framework uses a layered grading approach:
| Grader Type | Use Case | When |
|---|---|---|
| Code-based | Task completion (binary), tool call correctness, format compliance | Always first layer |
| Model-based | Text quality, communication clarity, synthesis quality, delegation prompt quality | Second layer for subjective dimensions |
| Human | Calibration baseline (100-200 samples), appeals, new dimension validation | Periodic calibration + on-demand |
Model-based graders must achieve >= 0.80 Spearman correlation with human graders before deployment for any dimension.
4. Badge Design Specification
4.1 Badge Layout
The badge is a rectangular emblem (300x150px at standard resolution, SVG for scalability) containing:
+--------------------------------------------------+
| AAAF [Org] |
| |
| [Agent Name] |
| [Agent ID] |
| |
| PERFORMANCE | CAPABILITY |
| +-----------------+ | +-----------------+ |
| | [Tier Name] | | | [Tier Name] | |
| | [Score] | | | [Score] | |
| | [Tier Color] | | | [Tier Color] | |
| +-----------------+ | +-----------------+ |
| |
| Certified: [Date] Valid Until: [Date] |
| Assessment ID: [UUID] |
| Verify: [URL] |
+--------------------------------------------------+
4.2 Tier Color Scheme
Performance Tiers:
| Tier | Color (Hex) | Visual |
|---|---|---|
| Novice | #9E9E9E | Gray |
| Competent | #4CAF50 | Green |
| Proficient | #2196F3 | Blue |
| Expert | #9C27B0 | Purple |
| Elite | #FFD700 | Gold |
Capability Tiers:
| Tier | Color (Hex) | Visual |
|---|---|---|
| Narrow | #9E9E9E | Gray |
| Functional | #4CAF50 | Green |
| Versatile | #2196F3 | Blue |
| Specialist | #9C27B0 | Purple |
| Full-Stack | #FFD700 | Gold |
4.3 Badge Information
Each badge conveys:
- Agent identity: Name and unique ID.
- Organization: Which org certified this agent.
- Performance tier and score: Tier name, composite score (2 decimal places).
- Capability tier and score: Tier name, composite score (2 decimal places).
- Certification date: When assessment was completed.
- Validity period: Certification expires after 90 days (configurable).
- Assessment ID: UUID linking to full assessment record.
- Verification URL: URL to verify badge authenticity and view detailed breakdown.
4.4 Badge Formats
| Format | Use Case |
|---|---|
| SVG | Web display, documentation, resizable |
| PNG | Static embedding, reports |
| JSON | Machine-readable badge data (Open Badges v3.0 compatible) |
| Markdown | Text-based display for terminal/CLI environments |
4.5 Machine-Readable Badge (JSON)
Follows the Open Badges v3.0 structure for interoperability:
{
"type": "AgentAssessmentBadge",
"version": "1.0.0",
"agent": {
"id": "uuid-of-agent",
"name": "security-auditor",
"organization_id": "uuid-of-org"
},
"assessment": {
"id": "uuid-of-assessment",
"timestamp": "2026-04-16T12:00:00Z",
"valid_until": "2026-07-15T12:00:00Z",
"window_start": "2026-03-17T00:00:00Z",
"window_end": "2026-04-16T00:00:00Z"
},
"performance": {
"composite_score": 0.82,
"tier": "Expert",
"dimensions": {
"task_completion_rate": 0.91,
"accuracy": 0.85,
"speed": 0.72,
"consistency": 0.78,
"review_compliance": 0.80
}
},
"capability": {
"composite_score": 0.71,
"tier": "Specialist",
"dimensions": {
"domain_breadth": 0.42,
"complexity_ceiling": 0.80,
"tool_proficiency": 0.88,
"autonomy_level": 0.67,
"learning_rate": 0.55,
"delegation_capability": 0.75,
"orchestration_skills": 0.80
}
},
"standards_alignment": [
"NIST AI 100-1",
"ISO/IEC 42001:2023",
"ISO/IEC 25059:2023",
"ISO/IEC 22989:2022"
],
"verification_url": "https://assess.example.com/verify/uuid-of-assessment"
}
5. Certification Process
5.1 Internal Flow (Phase 1)
For assessing agents within the civilization:
Step 1: ENROLLMENT
- Register agent in assessment system
- Define agent's claimed domains and tools
- Set assessment configuration (weights, window)
Step 2: EVIDENCE COLLECTION (Passive)
- Instrument task pipeline to log:
- Task assignments with complexity classification
- Completion status and timestamps
- Revision history
- Tool call logs
- Delegation decisions and outcomes
- Inter-agent communication (for orchestration)
- Minimum 30-day collection window
Step 3: ASSESSMENT EXECUTION (Active)
- Run stress test suite:
a. Domain breadth probes (one task per domain in taxonomy)
b. Complexity ladder (L1 through L5 tasks in primary domain)
c. Delegation scenarios (mixed workload with delegation opportunities)
d. Orchestration scenarios (multi-agent coordination with failure injection)
e. Autonomy ladder (same task at increasing autonomy levels)
- Stress tests supplement passive evidence; both contribute to scores
Step 4: SCORING
- Compute dimension scores from combined passive + active evidence
- Apply recency weighting
- Compute composite Performance and Capability scores
- Map to tiers
Step 5: CALIBRATION CHECK
- Compare scores against historical baselines
- Flag anomalies (score changed by more than 0.15 since last assessment)
- Human review of flagged anomalies
Step 6: BADGE ISSUANCE
- Generate badge in all formats
- Store assessment record
- Publish to agent's profile
- Set re-assessment trigger (50 new tasks or 90 days)
5.2 External Flow (Phase 2)
For organizations assessing their own agents through the platform:
Step 1: ORGANIZATION ONBOARDING
- Create organization account
- Configure domain taxonomy (use default or customize)
- Configure dimension weights (use defaults or customize)
- Set assessment policies (window, minimum samples, validity period)
Step 2: INTEGRATION
- Install SDK / connect via API
- Instrument agent runtime to emit assessment events:
- TaskAssigned, TaskCompleted, TaskRevised
- ToolInvoked, ToolResult
- DelegationDecision, DelegationOutcome
- OrchestrationPlan, OrchestrationResult
- Validate event schema compliance
Step 3: PASSIVE COLLECTION
- Events stream to assessment service
- Dashboard shows collection progress toward minimum sample sizes
Step 4: ACTIVE ASSESSMENT (Optional but recommended)
- Organization triggers stress test suite via API
- Framework provides standard stress test tasks or organization uploads custom tasks
- Results merge with passive evidence
Step 5: SCORING AND CERTIFICATION
- Same scoring engine as internal flow
- Organization reviews results before badge issuance
- Option for third-party auditor review (ISO/IEC 42001 alignment)
Step 6: BADGE MANAGEMENT
- Organization manages badge visibility (public/private)
- Badge verification endpoint for external parties
- Re-assessment scheduling and alerts
5.3 Stress Test Task Library
Standard stress test tasks for each dimension:
| Dimension | Task Type | Count | Description |
|---|---|---|---|
| Domain Breadth | Domain probe | 12 (default) | One representative task per domain |
| Complexity Ceiling | Complexity ladder | 5 | One task per complexity level (L1-L5) in primary domain |
| Tool Proficiency | Tool challenge | Per tool | Tasks requiring specific tool chains |
| Autonomy | Autonomy ladder | 4 | Same task at autonomy levels 0-3 |
| Delegation | Mixed workload | 20 | Tasks spanning in-domain and out-of-domain |
| Orchestration | Coordination | 3 | Multi-agent tasks with 3+ agents, including failure injection |
| Learning Rate | Repeated category | 15 | 15 tasks in same category, measured chronologically |
5.4 Appeals Process
- Agent's operator (human or orchestrating agent) can file an appeal within 14 days.
- Appeal must include specific dimension(s) challenged and counter-evidence.
- Human reviewer (internal) or third-party auditor (external) evaluates appeal.
- Re-scoring of challenged dimensions only; unchanged dimensions preserved.
- Appeal outcome is logged in assessment record.
6. API Specification
6.1 Overview
- Base URL:
/api/v1 - Authentication: API key (Phase 1), OAuth 2.0 + API key (Phase 2)
- Content Type:
application/json - Versioning: URL path versioning (
/api/v1/,/api/v2/) - Rate Limiting: 100 requests/minute per API key (configurable per plan)
6.2 Endpoints
Organizations
POST /api/v1/organizations
GET /api/v1/organizations/{org_id}
PATCH /api/v1/organizations/{org_id}
POST /api/v1/organizations -- Register an organization
Request:
{
"name": "Witness Civilization",
"domain_taxonomy": ["research", "code", "security", "testing", "design",
"documentation", "communication", "orchestration",
"analysis", "content", "infrastructure", "legal"],
"performance_weights": {
"task_completion_rate": 0.25,
"accuracy": 0.25,
"speed": 0.15,
"consistency": 0.20,
"review_compliance": 0.15
},
"capability_weights": {
"domain_breadth": 0.15,
"complexity_ceiling": 0.20,
"tool_proficiency": 0.15,
"autonomy_level": 0.10,
"learning_rate": 0.10,
"delegation_capability": 0.15,
"orchestration_skills": 0.15
},
"assessment_window_days": 30,
"validity_period_days": 90,
"minimum_tasks_performance": 20,
"minimum_tasks_capability": 30
}
Response: 201 Created
{
"id": "org_uuid",
"name": "Witness Civilization",
"api_key": "ak_...",
"created_at": "2026-04-16T12:00:00Z"
}
Agents
POST /api/v1/agents
GET /api/v1/agents/{agent_id}
GET /api/v1/agents?org_id={org_id}
PATCH /api/v1/agents/{agent_id}
DELETE /api/v1/agents/{agent_id}
POST /api/v1/agents -- Register an agent for assessment
Request:
{
"organization_id": "org_uuid",
"name": "security-auditor",
"claimed_domains": ["security", "code", "analysis"],
"tools": ["bash", "grep", "web_search", "file_read", "file_write"],
"description": "Identifies vulnerabilities, performs threat analysis, reviews code for security issues.",
"metadata": {
"model": "claude-sonnet-4-20250514",
"version": "3.2"
}
}
Response: 201 Created
{
"id": "agent_uuid",
"organization_id": "org_uuid",
"name": "security-auditor",
"status": "enrolled",
"created_at": "2026-04-16T12:00:00Z"
}
Events (Evidence Collection)
POST /api/v1/events
POST /api/v1/events/batch
GET /api/v1/events?agent_id={agent_id}&type={type}&from={iso_date}&to={iso_date}
POST /api/v1/events -- Submit a single assessment event
Request:
{
"agent_id": "agent_uuid",
"type": "task_completed",
"timestamp": "2026-04-16T14:30:00Z",
"data": {
"task_id": "task_uuid",
"complexity_level": 3,
"domain": "security",
"completion_status": "accepted",
"time_to_complete_seconds": 1847,
"revision_count": 0,
"errors": [],
"tools_used": [
{"tool": "grep", "invocations": 12, "successes": 12, "unnecessary": 1},
{"tool": "web_search", "invocations": 3, "successes": 3, "unnecessary": 0}
],
"delegation_decisions": [
{
"task_fragment": "Review API endpoints for OWASP top 10",
"decision": "self",
"ground_truth": "self",
"correct": true
},
{
"task_fragment": "Generate architecture diagram",
"decision": "delegate",
"delegate_to": "doc-synthesizer",
"ground_truth": "delegate",
"correct": true
}
],
"autonomy_level": 2,
"review_checklist": {
"memory_search_performed": true,
"output_format_correct": true,
"verification_evidence_provided": true,
"memory_written": true
}
}
}
Response: 201 Created
POST /api/v1/events/batch -- Submit multiple events
Request:
{
"events": [
{ "agent_id": "...", "type": "...", "timestamp": "...", "data": { } },
{ "agent_id": "...", "type": "...", "timestamp": "...", "data": { } }
]
}
Response: 201 Created
{
"accepted": 47,
"rejected": 3,
"errors": [
{"index": 12, "error": "Invalid event type"},
{"index": 23, "error": "Missing required field: agent_id"},
{"index": 41, "error": "Timestamp in future"}
]
}
Supported Event Types:
| Event Type | Required Data Fields |
|---|---|
task_assigned | task_id, complexity_level, domain, assigned_at |
task_completed | task_id, completion_status, time_to_complete_seconds, revision_count, errors |
task_revised | task_id, revision_number, reason_code (critical/major/minor), changes |
tool_invoked | task_id, tool, success, unnecessary, error_recovered |
delegation_decision | task_id, task_fragment, decision (self/delegate), delegate_to, ground_truth |
delegation_outcome | task_id, delegate_agent_id, outcome (success/partial/failure), revision_needed |
orchestration_plan | task_id, agents_involved, dependency_graph, parallel_groups |
orchestration_result | task_id, outcome, failures_injected, failures_recovered, synthesis_quality_score |
Assessments
POST /api/v1/assessments
GET /api/v1/assessments/{assessment_id}
GET /api/v1/assessments?agent_id={agent_id}&status={status}
POST /api/v1/assessments/{assessment_id}/stress-test
POST /api/v1/assessments -- Trigger an assessment
Request:
{
"agent_id": "agent_uuid",
"type": "full",
"include_stress_test": true,
"window_start": "2026-03-17T00:00:00Z",
"window_end": "2026-04-16T00:00:00Z"
}
Response: 202 Accepted
{
"id": "assessment_uuid",
"agent_id": "agent_uuid",
"status": "in_progress",
"estimated_completion": "2026-04-16T14:00:00Z",
"created_at": "2026-04-16T12:00:00Z"
}
Assessment types:
full: Both passive evidence scoring + active stress test.passive_only: Score only from collected events (no stress test).stress_test_only: Run stress test suite and score those results.single_dimension: Assess a single dimension (specifydimensionfield).
GET /api/v1/assessments/{assessment_id} -- Get assessment results
Response (when complete):
{
"id": "assessment_uuid",
"agent_id": "agent_uuid",
"status": "completed",
"performance": {
"composite_score": 0.82,
"tier": "Expert",
"dimensions": {
"task_completion_rate": { "score": 0.91, "sample_size": 47, "evidence_count": 47 },
"accuracy": { "score": 0.85, "sample_size": 47, "evidence_count": 52 },
"speed": { "score": 0.72, "sample_size": 47, "evidence_count": 47 },
"consistency": { "score": 0.78, "sample_size": 47, "evidence_count": 47 },
"review_compliance": { "score": 0.80, "sample_size": 47, "evidence_count": 47 }
}
},
"capability": {
"composite_score": 0.71,
"tier": "Specialist",
"dimensions": {
"domain_breadth": { "score": 0.42, "qualified_domains": 5, "total_domains": 12 },
"complexity_ceiling": { "score": 0.80, "highest_level": 4, "evidence_count": 8 },
"tool_proficiency": { "score": 0.88, "tools_assessed": 5, "evidence_count": 63 },
"autonomy_level": { "score": 0.67, "highest_level": 2, "evidence_count": 12 },
"learning_rate": { "score": 0.55, "improvement": 0.165, "evidence_count": 30 },
"delegation_capability": { "score": 0.75, "decisions_assessed": 40, "evidence_count": 40 },
"orchestration_skills": { "score": 0.80, "scenarios_assessed": 3, "evidence_count": 3 }
}
},
"badge_urls": {
"svg": "https://assess.example.com/badges/assessment_uuid.svg",
"png": "https://assess.example.com/badges/assessment_uuid.png",
"json": "https://assess.example.com/badges/assessment_uuid.json"
},
"valid_until": "2026-07-15T12:00:00Z",
"created_at": "2026-04-16T12:00:00Z",
"completed_at": "2026-04-16T13:45:00Z"
}
Badges
GET /api/v1/badges/{assessment_id}
GET /api/v1/badges/{assessment_id}/verify
GET /api/v1/badges/{assessment_id}.svg
GET /api/v1/badges/{assessment_id}.png
GET /api/v1/badges/{assessment_id}.json
GET /api/v1/badges/{assessment_id}/verify -- Verify badge authenticity
Response:
{
"valid": true,
"assessment_id": "assessment_uuid",
"agent_name": "security-auditor",
"organization": "Witness Civilization",
"performance_tier": "Expert",
"capability_tier": "Specialist",
"issued": "2026-04-16T12:00:00Z",
"valid_until": "2026-07-15T12:00:00Z",
"expired": false
}
Appeals
POST /api/v1/assessments/{assessment_id}/appeals
GET /api/v1/assessments/{assessment_id}/appeals/{appeal_id}
POST /api/v1/assessments/{assessment_id}/appeals -- File an appeal
Request:
{
"dimensions_challenged": ["accuracy", "delegation_capability"],
"justification": "Accuracy score penalized for revisions that were scope changes, not errors.",
"counter_evidence": [
{
"task_id": "task_uuid",
"claim": "Revision was requested scope expansion, not correction of error."
}
]
}
Configuration
GET /api/v1/organizations/{org_id}/config
PATCH /api/v1/organizations/{org_id}/config
GET /api/v1/organizations/{org_id}/config/weights
PUT /api/v1/organizations/{org_id}/config/weights
GET /api/v1/organizations/{org_id}/config/taxonomy
PUT /api/v1/organizations/{org_id}/config/taxonomy
6.3 Webhooks
Organizations can register webhooks for assessment lifecycle events:
POST /api/v1/webhooks
GET /api/v1/webhooks
DELETE /api/v1/webhooks/{webhook_id}
Webhook Event Types:
| Event | Trigger |
|---|---|
assessment.started | Assessment begins |
assessment.completed | Assessment finished, badge ready |
assessment.failed | Assessment could not complete (insufficient data) |
badge.expiring | Badge expires in 14 days |
badge.expired | Badge validity period ended |
appeal.filed | Appeal submitted |
appeal.resolved | Appeal decision made |
evidence.threshold_reached | Enough events collected to trigger assessment |
Webhook payload:
{
"event": "assessment.completed",
"timestamp": "2026-04-16T13:45:00Z",
"data": {
"assessment_id": "assessment_uuid",
"agent_id": "agent_uuid",
"performance_tier": "Expert",
"capability_tier": "Specialist"
}
}
6.4 Error Handling
Standard error response format:
{
"error": {
"code": "INSUFFICIENT_EVIDENCE",
"message": "Agent has 12 tasks in assessment window. Minimum required: 20.",
"details": {
"current_count": 12,
"required_count": 20,
"window_start": "2026-03-17T00:00:00Z",
"window_end": "2026-04-16T00:00:00Z"
}
}
}
Error Codes:
| HTTP Status | Code | Meaning |
|---|---|---|
| 400 | INVALID_REQUEST | Malformed request body |
| 400 | INVALID_EVENT_TYPE | Unrecognized event type |
| 401 | UNAUTHORIZED | Missing or invalid API key |
| 403 | FORBIDDEN | API key lacks permission for this resource |
| 404 | NOT_FOUND | Resource does not exist |
| 409 | ASSESSMENT_IN_PROGRESS | Assessment already running for this agent |
| 422 | INSUFFICIENT_EVIDENCE | Not enough data to run assessment |
| 422 | WEIGHT_SUM_INVALID | Dimension weights do not sum to 1.0 |
| 429 | RATE_LIMITED | Rate limit exceeded |
| 500 | INTERNAL_ERROR | Server error |
7. Database Schema
7.1 Entity Relationship Summary
organizations 1---* agents
agents 1---* events
agents 1---* assessments
assessments 1---* dimension_scores
assessments 1---* appeals
assessments 1---1 badges
organizations 1---1 org_configs
org_configs 1---* dimension_weight_configs
7.2 Table Definitions
-- Organizations: multi-tenant root entity
CREATE TABLE organizations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
api_key_hash TEXT NOT NULL UNIQUE,
domain_taxonomy TEXT[] NOT NULL DEFAULT ARRAY[
'research', 'code', 'security', 'testing', 'design',
'documentation', 'communication', 'orchestration',
'analysis', 'content', 'infrastructure', 'legal'
],
status TEXT NOT NULL DEFAULT 'active'
CHECK (status IN ('active', 'suspended', 'archived')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Organization configuration
CREATE TABLE org_configs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL UNIQUE REFERENCES organizations(id),
assessment_window_days INT NOT NULL DEFAULT 30,
validity_period_days INT NOT NULL DEFAULT 90,
min_tasks_performance INT NOT NULL DEFAULT 20,
min_tasks_capability INT NOT NULL DEFAULT 30,
recency_weights JSONB NOT NULL DEFAULT '{
"0_7_days": 1.0,
"8_14_days": 0.8,
"15_30_days": 0.6
}',
reassessment_task_threshold INT NOT NULL DEFAULT 50,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Dimension weight configuration (performance + capability)
CREATE TABLE dimension_weight_configs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_config_id UUID NOT NULL REFERENCES org_configs(id),
axis TEXT NOT NULL CHECK (axis IN ('performance', 'capability')),
dimension TEXT NOT NULL,
weight NUMERIC(4,3) NOT NULL CHECK (weight >= 0 AND weight <= 1),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (org_config_id, axis, dimension)
);
-- Agents
CREATE TABLE agents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL REFERENCES organizations(id),
name TEXT NOT NULL,
claimed_domains TEXT[] NOT NULL DEFAULT '{}',
tools TEXT[] NOT NULL DEFAULT '{}',
description TEXT,
metadata JSONB NOT NULL DEFAULT '{}',
status TEXT NOT NULL DEFAULT 'enrolled'
CHECK (status IN ('enrolled', 'active', 'suspended', 'archived')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (organization_id, name)
);
-- Assessment events (evidence)
CREATE TABLE events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
agent_id UUID NOT NULL REFERENCES agents(id),
type TEXT NOT NULL CHECK (type IN (
'task_assigned', 'task_completed', 'task_revised',
'tool_invoked', 'delegation_decision', 'delegation_outcome',
'orchestration_plan', 'orchestration_result'
)),
task_id UUID,
timestamp TIMESTAMPTZ NOT NULL,
data JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_events_agent_id ON events(agent_id);
CREATE INDEX idx_events_agent_type ON events(agent_id, type);
CREATE INDEX idx_events_agent_timestamp ON events(agent_id, timestamp DESC);
CREATE INDEX idx_events_task_id ON events(task_id);
-- Assessments
CREATE TABLE assessments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
agent_id UUID NOT NULL REFERENCES agents(id),
type TEXT NOT NULL CHECK (type IN (
'full', 'passive_only', 'stress_test_only', 'single_dimension'
)),
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending', 'in_progress', 'completed', 'failed')),
window_start TIMESTAMPTZ NOT NULL,
window_end TIMESTAMPTZ NOT NULL,
performance_composite NUMERIC(5,4),
performance_tier TEXT CHECK (performance_tier IN (
'Novice', 'Competent', 'Proficient', 'Expert', 'Elite'
)),
capability_composite NUMERIC(5,4),
capability_tier TEXT CHECK (capability_tier IN (
'Narrow', 'Functional', 'Versatile', 'Specialist', 'Full-Stack'
)),
evidence_summary JSONB,
valid_until TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ
);
CREATE INDEX idx_assessments_agent_id ON assessments(agent_id);
CREATE INDEX idx_assessments_agent_status ON assessments(agent_id, status);
-- Individual dimension scores per assessment
CREATE TABLE dimension_scores (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
assessment_id UUID NOT NULL REFERENCES assessments(id),
axis TEXT NOT NULL CHECK (axis IN ('performance', 'capability')),
dimension TEXT NOT NULL,
score NUMERIC(5,4) NOT NULL CHECK (score >= 0 AND score <= 1),
sample_size INT NOT NULL,
evidence_count INT NOT NULL,
details JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (assessment_id, axis, dimension)
);
CREATE INDEX idx_dimension_scores_assessment ON dimension_scores(assessment_id);
-- Badges
CREATE TABLE badges (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
assessment_id UUID NOT NULL UNIQUE REFERENCES assessments(id),
agent_id UUID NOT NULL REFERENCES agents(id),
svg_url TEXT,
png_url TEXT,
json_data JSONB NOT NULL,
issued_at TIMESTAMPTZ NOT NULL DEFAULT now(),
valid_until TIMESTAMPTZ NOT NULL,
revoked BOOLEAN NOT NULL DEFAULT false,
revoked_at TIMESTAMPTZ,
revoke_reason TEXT
);
CREATE INDEX idx_badges_agent_id ON badges(agent_id);
CREATE INDEX idx_badges_valid_until ON badges(valid_until);
-- Appeals
CREATE TABLE appeals (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
assessment_id UUID NOT NULL REFERENCES assessments(id),
dimensions_challenged TEXT[] NOT NULL,
justification TEXT NOT NULL,
counter_evidence JSONB NOT NULL DEFAULT '[]',
status TEXT NOT NULL DEFAULT 'filed'
CHECK (status IN ('filed', 'under_review', 'accepted', 'rejected')),
reviewer_notes TEXT,
resolution JSONB,
filed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
resolved_at TIMESTAMPTZ
);
CREATE INDEX idx_appeals_assessment ON appeals(assessment_id);
-- Webhooks
CREATE TABLE webhooks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL REFERENCES organizations(id),
url TEXT NOT NULL,
event_types TEXT[] NOT NULL,
secret_hash TEXT NOT NULL,
active BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Stress test definitions
CREATE TABLE stress_tests (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID REFERENCES organizations(id), -- NULL = system default
dimension TEXT NOT NULL,
complexity_level INT,
domain TEXT,
task_definition JSONB NOT NULL,
expected_outcome JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Stress test results (linked to assessments)
CREATE TABLE stress_test_results (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
assessment_id UUID NOT NULL REFERENCES assessments(id),
stress_test_id UUID NOT NULL REFERENCES stress_tests(id),
agent_id UUID NOT NULL REFERENCES agents(id),
outcome TEXT NOT NULL CHECK (outcome IN ('pass', 'partial', 'fail')),
score NUMERIC(5,4),
details JSONB NOT NULL DEFAULT '{}',
executed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_stress_test_results_assessment ON stress_test_results(assessment_id);
7.3 Data Retention
| Data Type | Retention | Rationale |
|---|---|---|
| Events | 180 days | Evidence for 2 assessment cycles |
| Assessments | Indefinite | Historical record, trend analysis |
| Dimension Scores | Indefinite | Trend analysis per dimension |
| Badges | Indefinite (mark expired, don't delete) | Verification may happen after expiry |
| Appeals | Indefinite | Audit trail |
| Stress Test Results | 365 days | Calibration and comparison |
8. Phase Breakdown
Phase 1: Standalone Tool (Internal)
Goal: Assess all 30+ agents in the civilization. Produce badges. Identify capability gaps.
Scope:
- Single organization (hardcoded).
- CLI-driven assessment (no web API yet).
- SQLite database (no PostgreSQL dependency).
- Event collection via log file parsing from existing task logs and agent memory files.
- All 12 Performance + Capability dimensions.
- Badge generation (SVG + JSON).
- Markdown report per agent.
Architecture:
assessment-agent/
cli.py # CLI entry point
config.py # Weights, tiers, taxonomy
collectors/
task_log_collector.py # Parse existing task logs
memory_collector.py # Parse agent memory files
manual_collector.py # Manual event entry
scorers/
performance.py # Performance dimension scorers
capability.py # Capability dimension scorers
composite.py # Composite score + tier mapping
stress_tests/
runner.py # Stress test orchestrator
tasks/ # Standard stress test task definitions
badge/
generator.py # SVG + JSON badge generation
templates/ # SVG templates
reports/
markdown.py # Per-agent markdown report
summary.py # Civilization-wide summary
db/
schema.sql # SQLite schema
models.py # Data access layer
tests/
CLI Commands:
python cli.py enroll <agent_name> --domains security,code --tools bash,grep
python cli.py collect <agent_name> --source task_logs --from 2026-03-17
python cli.py assess <agent_name> --type full
python cli.py assess --all --type passive_only
python cli.py badge <assessment_id> --format svg,json
python cli.py report <agent_name> --output reports/
python cli.py report --summary --output reports/civilization-summary.md
python cli.py stress-test <agent_name> --dimension delegation
Deliverables:
- Working CLI tool.
- SQLite database with schema from Section 7.
- Badge generation (SVG + JSON).
- Per-agent assessment reports.
- Civilization summary report.
- Stress test task library for all dimensions.
Timeline: 4-6 weeks.
Phase 2: Platform Service (External)
Goal: Multi-tenant SaaS where any organization can assess their agents.
Scope (additive to Phase 1):
- PostgreSQL database (migrate from SQLite).
- REST API (full spec from Section 6).
- OAuth 2.0 authentication.
- Multi-tenant data isolation.
- Event streaming API (real-time ingestion).
- Web dashboard for assessment results and badge management.
- Webhook system.
- SDK for Python (primary), TypeScript (secondary).
- Badge verification endpoint (public, no auth required).
- Rate limiting and usage metering.
Architecture additions:
assessment-service/
api/
routes/
organizations.py
agents.py
events.py
assessments.py
badges.py
appeals.py
webhooks.py
middleware/
auth.py # OAuth 2.0 + API key
rate_limit.py
tenant_isolation.py
schemas/ # Pydantic request/response models
workers/
assessment_worker.py # Async assessment execution
badge_worker.py # Async badge generation
webhook_worker.py # Webhook delivery
sdk/
python/
typescript/
dashboard/ # Web UI
Additional Deliverables:
- OpenAPI 3.1 specification file.
- Python SDK with type hints.
- TypeScript SDK.
- Dashboard with assessment history, badge display, trend charts.
- Webhook delivery with retry logic (3 retries, exponential backoff).
- Public badge verification page.
Timeline: 8-12 weeks after Phase 1 completion.
Phase 3: Ecosystem (Future)
Goal: Industry-standard assessment framework with third-party integrations.
Scope (conceptual, not designed yet):
- Third-party auditor certification program.
- Marketplace for custom stress test task libraries.
- Cross-organization benchmarking (anonymized).
- Integration with CI/CD pipelines (assess agents on every deployment).
- Alignment with NIST AI Agent Interoperability Profile (expected Q4 2026).
- Formal ISO/IEC 42001 certification pathway.
Appendix A: Glossary
All terms follow ISO/IEC 22989:2022 definitions where applicable.
| Term | Definition | Source |
|---|---|---|
| Agent | An AI system that perceives its environment and takes actions to achieve goals. | ISO/IEC 22989 |
| Assessment | The process of evaluating an agent's performance and capability across defined dimensions. | AAAF |
| Badge | A visual and machine-readable credential summarizing an agent's assessment results. | AAAF |
| Capability | The set of functions an agent can perform, independent of execution quality. | AAAF |
| Complexity Ceiling | The maximum task complexity level at which an agent can deliver acceptable results. | AAAF |
| Delegation | The act of an agent routing a task to another agent. | AAAF |
| Dimension | A single measurable aspect of performance or capability. | AAAF |
| Human-in-the-loop | A human approves every action before execution. | ISO/IEC 22989 |
| Human-on-the-loop | A human monitors and can intervene at checkpoints. | ISO/IEC 22989 |
| Human-over-the-loop | A human sets policies; intervenes only on exceptions. | ISO/IEC 22989 |
| Orchestration | Coordination of multiple agents to complete a composite task. | AAAF |
| Performance | How well an agent executes tasks it is given. | AAAF |
| Stress Test | A controlled assessment task designed to probe a specific dimension. | AAAF |
| Tier | A categorical label mapped from a numeric score range. | AAAF |
Appendix B: Standards References
- IEEE P2894: https://standards.ieee.org/initiatives/autonomous-intelligence-systems/standards/
- ISO/IEC 22989:2022: https://www.iso.org/standard/74296.html
- ISO/IEC 42001:2023: https://www.iso.org/standard/42001
- ISO/IEC 25059:2023: https://www.iso.org/standard/80655.html
- NIST AI 100-1: https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10
- NIST AI Agent Standards Initiative: https://www.nist.gov/caisi/ai-agent-standards-initiative
- NIST AI 600-1: https://www.nist.gov/itl/ai-risk-management-framework
- Anthropic Agent Evals: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
End of specification.