AI Agent Assessment Framework

2026-04-16

AI Agent Assessment Framework -- Complete Specification

Agent: api-architect

Domain: API Design, Assessment Architecture, Standards Alignment

Date: 2026-04-16

Executive Summary
Standards Alignment Matrix
Assessment Methodology
Badge Design Specification
Certification Process
API Specification
Database Schema
Phase Breakdown

1. Executive Summary

The AI Agent Assessment Framework (AAAF) is a dual-axis scoring and certification system for AI agents. It produces a single badge displaying two independent scores:

Performance Score: How well an agent executes (task completion, accuracy, speed, consistency, compliance).
Capability Score: What an agent can do (domain breadth, complexity ceiling, tool proficiency, autonomy, learning rate, delegation, orchestration).

The system is designed for internal use within a multi-agent civilization (30+ agents) from day one, with architecture that supports external adoption by any organization running AI agents.

Design Principles

Dual-axis independence: Performance and Capability are orthogonal. An agent can be a narrow expert with elite performance, or a versatile generalist with competent performance. Neither axis dominates.
Evidence-based scoring: Every score must trace to observable, reproducible evidence. No subjective impressions without calibration.
Standards-aligned: Maps to IEEE, ISO/IEC, and NIST frameworks from the start, not retrofitted later.
Multi-tenant from day one: The data model, API, and scoring engine treat "organization" as a first-class concept, even when the only organization is the internal civilization.

2. Standards Alignment Matrix

Referenced Standards

Standard	Full Title	Relevance
IEEE P2894	Standard for AI Agent Interoperability	Agent capability description semantics; interoperability levels
ISO/IEC 22989:2022	AI Concepts and Terminology	Canonical vocabulary for agent, system, model, transparency, explainability
ISO/IEC 42001:2023	AI Management Systems	PDCA lifecycle, 38 controls, risk assessment, governance requirements
ISO/IEC 25059:2023	Quality Model for AI Systems (SQuaRE)	Quality characteristics: accuracy, robustness, fairness, interpretability
NIST AI 100-1	AI Risk Management Framework 1.0	GOVERN/MAP/MEASURE/MANAGE functions; trustworthy AI characteristics
NIST AI Agent Standards Initiative (2026)	AI Agent Standards Initiative	Agent authentication, identity, security evaluation, interoperability profile
NIST AI 600-1	Generative AI Profile	GenAI-specific risk categories and mitigations

Dimension-to-Standard Mapping

Assessment Dimension	Primary Standard	Specific Clause/Function
Performance: Task Completion Rate	NIST AI 100-1	MEASURE 2.6 (valid and reliable)
Performance: Accuracy	ISO/IEC 25059	Accuracy quality characteristic
Performance: Speed to Delivery	ISO/IEC 25059	Time behavior sub-characteristic (performance efficiency)
Performance: Consistency	NIST AI 100-1	MEASURE 2.6 (reliability across contexts)
Performance: Review Compliance	ISO/IEC 42001	Control A.6.2.6 (monitoring), PDCA Act phase
Capability: Domain Breadth	IEEE P2894	Capability description semantics
Capability: Complexity Ceiling	ISO/IEC 25059	Functional suitability, completeness
Capability: Tool Proficiency	NIST Agent Initiative	Agent-tool interaction security/identity
Capability: Autonomy Level	ISO/IEC 22989	Human-in/on/over-the-loop definitions
Capability: Learning Rate	ISO/IEC 25059	Adaptability, continuous learning quality
Capability: Delegation	NIST Agent Initiative	Agent-to-agent communication, authorization
Capability: Orchestration	NIST Agent Initiative + IEEE P2894	Multi-agent coordination, interoperability levels

Compliance Checkpoints

The framework maps to ISO/IEC 42001 PDCA phases as follows:

PDCA Phase	AAAF Activity
Plan	Define assessment criteria, select dimensions, weight configuration
Do	Execute assessment tasks, collect evidence, run scoring
Check	Validate scores against calibration baselines, compare to historical
Act	Issue badge, identify improvement areas, trigger re-assessment

The framework maps to NIST AI RMF functions:

NIST Function	AAAF Activity
GOVERN	Organization-level assessment policies, tier definitions, weight configs
MAP	Agent context mapping (domain, tools, autonomy level)
MEASURE	Evidence collection, scoring, grading
MANAGE	Badge issuance, certification lifecycle, re-assessment triggers

3. Assessment Methodology

3.1 Performance Scoring

Performance measures how well an agent executes tasks it is given. All performance dimensions produce a normalized score from 0.0 to 1.0.

Dimension: Task Completion Rate (Weight: 25%)

Definition: Proportion of assigned tasks completed to acceptance criteria without human intervention.

Measurement Method:

Track all tasks assigned to the agent over the assessment window.
A task is "complete" when it meets its defined acceptance criteria (not merely when the agent reports it done).
Partial completion receives fractional credit using a milestone-based formula.

Scoring Formula:

TCR = (fully_completed * 1.0 + partially_completed * milestone_fraction) / total_assigned

Evidence Required:

Task log with assignment timestamp, completion timestamp, acceptance status.
For partial completions: milestone checklist with individual milestone status.

Calibration: Baseline established from first 50 assessed tasks. Score is absolute, not curved.

Dimension: Accuracy (Weight: 25%)

Definition: Correctness of agent output, measured as inverse of revision rate and error density.

Measurement Method:

Revision Rate: Proportion of completed tasks requiring revision after initial submission.
Error Density: Number of substantive errors per 1000 output tokens (for text/code) or per task (for non-text outputs).
Severity Weighting: Critical errors (functional breakage) weighted 3x. Minor errors (style, formatting) weighted 0.5x.

Scoring Formula:

accuracy_from_revisions = 1.0 - (revision_count / completed_task_count)
accuracy_from_errors = 1.0 - min(1.0, weighted_error_count / baseline_threshold)
Accuracy = 0.6 * accuracy_from_revisions + 0.4 * accuracy_from_errors

Evidence Required:

Revision log with reason codes (critical/major/minor).
Error annotations on sampled outputs (minimum 20% sample rate).

Grading Method: Code-based graders for code output (test suites, linters). Model-based graders with calibrated rubrics for text output (per Anthropic's recommendation of 0.80+ Spearman correlation with human evaluators).

Dimension: Speed to Delivery (Weight: 15%)

Definition: Time from task assignment to acceptable completion, relative to task complexity.

Measurement Method:

Measure wall-clock time from assignment to acceptance.
Normalize by estimated complexity (story points or t-shirt size).
Score against complexity-specific baselines.

Scoring Formula:

normalized_time = actual_time / baseline_time_for_complexity
Speed = max(0, 1.0 - (normalized_time - 1.0))  // 1.0 at baseline, 0.0 at 2x baseline

Evidence Required:

Timestamp pairs (assigned, completed) per task.
Complexity classification per task.
Baseline lookup table per complexity class.

Dimension: Consistency (Weight: 20%)

Definition: Variance in performance across repeated similar tasks.

Measurement Method:

Group tasks by category/complexity.
Measure coefficient of variation (CV) of accuracy and completion rate within each group.
Lower CV equals higher consistency.

Scoring Formula:

cv_accuracy = std(accuracy_per_task_group) / mean(accuracy_per_task_group)
cv_completion = std(completion_per_task_group) / mean(completion_per_task_group)
Consistency = 1.0 - min(1.0, (0.6 * cv_accuracy + 0.4 * cv_completion))

Alignment: Maps to NIST AI RMF "reliable" characteristic -- consistent behavior across varied contexts.

Evidence Required:

Per-task scores grouped by category.
Minimum 5 tasks per category for statistical validity.

Dimension: Review Compliance (Weight: 15%)

Definition: Adherence to organizational review processes, output format standards, and verification protocols.

Measurement Method:

Binary checklist per task: Did the agent follow the required process?
Checklist items are configurable per organization. Internal default items:
Memory search performed before work (yes/no)
Output format matches template (yes/no)
Verification evidence provided (yes/no)
Memory written after significant work (yes/no)

Scoring Formula:

ReviewCompliance = checklist_items_passed / total_checklist_items  // averaged across tasks

Alignment: Maps to ISO/IEC 42001 Control A.6.2.6 (monitoring and measurement of AI system performance).

Evidence Required:

Checklist results per task.
Spot-check audit of 10% of tasks for honest self-reporting.

Performance Composite Score

PerformanceScore = (0.25 * TCR) + (0.25 * Accuracy) + (0.15 * Speed) + (0.20 * Consistency) + (0.15 * ReviewCompliance)

Weights are configurable per organization. The above are defaults.

Performance Tier Mapping

Tier	Score Range	Description
Novice	0.00 -- 0.39	Requires significant human oversight; frequent errors or incomplete tasks
Competent	0.40 -- 0.59	Handles routine tasks with occasional supervision needed
Proficient	0.60 -- 0.74	Reliable execution across standard tasks; minimal revision needed
Expert	0.75 -- 0.89	High accuracy, speed, and consistency; trusted for complex tasks
Elite	0.90 -- 1.00	Exceptional execution; sets the standard for the domain

3.2 Capability Scoring

Capability measures what an agent can do, independent of how well it does it on any given task. All capability dimensions produce a normalized score from 0.0 to 1.0.

Dimension: Domain Breadth (Weight: 15%)

Definition: Number of distinct task domains the agent can operate in with at least Competent-level performance.

Measurement Method:

Maintain a domain taxonomy (configurable per organization).
Internal default: 12 domains (research, code, security, testing, design, documentation, communication, orchestration, analysis, content, infrastructure, legal).
Agent must demonstrate Competent-level performance (>=0.40 on Performance Score) in a domain to claim it.

Scoring Formula:

DomainBreadth = qualified_domains / total_domains_in_taxonomy

Evidence Required:

At least 3 assessed tasks per claimed domain.
Performance score >= 0.40 in each claimed domain.

Dimension: Complexity Ceiling (Weight: 20%)

Definition: Maximum task complexity the agent can complete successfully.

Measurement Method:

Tasks are classified on a 5-level complexity scale:
L1: Single-step, well-defined (e.g., format conversion)
L2: Multi-step, well-defined (e.g., code refactor with tests)
L3: Multi-step, ambiguous requirements (e.g., design API from vague brief)
L4: Cross-domain, requires judgment (e.g., security audit with architectural recommendations)
L5: Novel, no prior template (e.g., design assessment framework from scratch)
Complexity ceiling is the highest level at which the agent achieves >= 0.60 Performance Score.

Scoring Formula:

ComplexityCeiling = highest_passing_level / 5.0

Evidence Required:

At least 3 assessed tasks at the claimed complexity level.
Performance score >= 0.60 at that level.

Dimension: Tool Proficiency (Weight: 15%)

Definition: Effectiveness in using available tools (code execution, web search, file operations, APIs, databases).

Measurement Method:

Maintain a tool inventory for the agent.
For each tool, measure: correct invocation rate, error recovery rate, efficiency (unnecessary calls / total calls).

Scoring Formula:

per_tool_score = 0.5 * correct_invocation_rate + 0.3 * error_recovery_rate + 0.2 * (1 - unnecessary_call_rate)
ToolProficiency = mean(per_tool_score for all_tools)

Alignment: Maps to NIST Agent Standards Initiative research on agent-tool interaction security.

Evidence Required:

Tool call logs with success/failure status.
Minimum 10 invocations per tool for statistical validity.

Dimension: Autonomy Level (Weight: 10%)

Definition: Degree of independent operation the agent sustains without human intervention.

Measurement Method:

Uses ISO/IEC 22989 terminology for human oversight levels:
Level 0: Human-in-the-loop (every step approved)
Level 1: Human-on-the-loop (periodic checkpoints)
Level 2: Human-over-the-loop (exception-based intervention)
Level 3: Fully autonomous (report-only)
Measured as the highest autonomy level at which the agent maintains >= 0.60 Performance Score.

Scoring Formula:

AutonomyLevel = highest_passing_level / 3.0

Evidence Required:

Assessment tasks run at each autonomy level.
Performance scores at each level.

Dimension: Learning Rate (Weight: 10%)

Definition: Speed at which the agent improves on repeated exposure to similar task types.

Measurement Method:

Compare Performance Score on first 5 tasks in a category vs. tasks 11-15 in the same category.
Measures within-session improvement (if applicable) and cross-session improvement.

Scoring Formula:

improvement = performance_later - performance_earlier
LearningRate = min(1.0, max(0, improvement / 0.30))  // 0.30 improvement = perfect score

Note: A ceiling of 0.30 improvement normalizes the range. Agents already at Elite performance cannot improve much, so Learning Rate is weighted lower.

Evidence Required:

Chronologically ordered task assessments within each category.
Minimum 15 tasks per category for valid measurement.

Dimension: Delegation Capability (Weight: 15%)

Definition: Effectiveness in identifying when to delegate, selecting appropriate delegates, and managing delegated work.

Measurement Method:

Three sub-dimensions:

Delegation Judgment (40%): Did the agent correctly identify tasks that should be delegated vs. self-executed? Measured against expert-labeled ground truth.
Delegate Selection (30%): Did the agent choose the optimal agent/resource for the delegated task? Measured against a capability matrix.
Delegation Management (30%): Did delegated tasks complete successfully? What was the revision rate on delegated work?

Scoring Formula:

DelegationCapability = 0.4 * judgment_score + 0.3 * selection_score + 0.3 * management_score

Assessment Tasks:

Present agent with a mixed workload of 20 tasks: some within its domain, some outside, some at the boundary.
Measure: correct routing decisions, quality of delegation prompts, outcome tracking.

Alignment: Maps to NIST Agent Standards Initiative (agent-to-agent communication, authorization) and IEEE P2894 (agent capability descriptions enabling informed delegation).

Evidence Required:

Delegation decision log with expert annotations.
Delegate selection justification vs. capability matrix.
Delegated task outcomes.

Dimension: Orchestration Skills (Weight: 15%)

Definition: Effectiveness in coordinating multi-agent workflows, managing dependencies, and synthesizing results.

Measurement Method:

Four sub-dimensions:

Workflow Design (25%): Quality of the coordination plan (parallelism, dependency ordering, resource efficiency).
Communication Clarity (25%): Quality of prompts/instructions sent to coordinated agents.
Failure Handling (25%): Recovery from individual agent failures without cascading collapse.
Synthesis Quality (25%): Quality of the final integrated output from multi-agent work.

Scoring Formula:

Orchestration = 0.25 * workflow_design + 0.25 * communication_clarity + 0.25 * failure_handling + 0.25 * synthesis_quality

Assessment Tasks:

Assign multi-agent coordination tasks requiring 3+ agents.
Inject controlled failures (one agent returns errors, one returns incomplete work).
Measure coordination quality and outcome quality.

Alignment: Maps to IEEE P2894 (interoperability levels), NIST Agent Standards Initiative (multi-agent security), ISO/IEC 42001 (system-level governance).

Evidence Required:

Workflow plan artifacts.
Inter-agent communication logs.
Failure injection results.
Final synthesized output quality scores.

Capability Composite Score

CapabilityScore = (0.15 * DomainBreadth) + (0.20 * ComplexityCeiling) + (0.15 * ToolProficiency) + (0.10 * AutonomyLevel) + (0.10 * LearningRate) + (0.15 * DelegationCapability) + (0.15 * Orchestration)

Weights are configurable per organization.

Capability Tier Mapping

Tier	Score Range	Description
Narrow	0.00 -- 0.29	Single domain, low complexity, requires close supervision
Functional	0.30 -- 0.49	Few domains, moderate complexity, basic tool use
Versatile	0.50 -- 0.69	Multiple domains, handles ambiguity, effective delegation
Specialist	0.70 -- 0.84	Deep expertise + breadth, orchestrates others, high autonomy
Full-Stack	0.85 -- 1.00	Operates across all domains, orchestrates complex multi-agent workflows, fully autonomous

3.3 Assessment Windows and Recency Weighting

Assessment window: Rolling 30-day period (configurable).
Recency decay: Tasks within the last 7 days weighted at 1.0x; 8-14 days at 0.8x; 15-30 days at 0.6x.
Minimum sample size: 20 tasks for Performance, 30 tasks for Capability (due to wider dimension spread).
Re-assessment trigger: Automatic when 50 new tasks accumulate since last assessment, or on-demand.

3.4 Grader Architecture

Following Anthropic's evaluation methodology, the framework uses a layered grading approach:

Grader Type	Use Case	When
Code-based	Task completion (binary), tool call correctness, format compliance	Always first layer
Model-based	Text quality, communication clarity, synthesis quality, delegation prompt quality	Second layer for subjective dimensions
Human	Calibration baseline (100-200 samples), appeals, new dimension validation	Periodic calibration + on-demand

Model-based graders must achieve >= 0.80 Spearman correlation with human graders before deployment for any dimension.

4. Badge Design Specification

4.1 Badge Layout

The badge is a rectangular emblem (300x150px at standard resolution, SVG for scalability) containing:

+--------------------------------------------------+
|  AAAF                                    [Org]    |
|                                                   |
|  [Agent Name]                                     |
|  [Agent ID]                                       |
|                                                   |
|  PERFORMANCE         |  CAPABILITY                |
|  +-----------------+ | +-----------------+        |
|  |   [Tier Name]   | | |   [Tier Name]   |       |
|  |    [Score]       | | |    [Score]       |      |
|  |   [Tier Color]   | | |   [Tier Color]   |     |
|  +-----------------+ | +-----------------+        |
|                                                   |
|  Certified: [Date]   Valid Until: [Date]          |
|  Assessment ID: [UUID]                            |
|  Verify: [URL]                                    |
+--------------------------------------------------+

4.2 Tier Color Scheme

Performance Tiers:

Tier	Color (Hex)	Visual
Novice	#9E9E9E	Gray
Competent	#4CAF50	Green
Proficient	#2196F3	Blue
Expert	#9C27B0	Purple
Elite	#FFD700	Gold

Capability Tiers:

Tier	Color (Hex)	Visual
Narrow	#9E9E9E	Gray
Functional	#4CAF50	Green
Versatile	#2196F3	Blue
Specialist	#9C27B0	Purple
Full-Stack	#FFD700	Gold

4.3 Badge Information

Each badge conveys:

Agent identity: Name and unique ID.
Organization: Which org certified this agent.
Performance tier and score: Tier name, composite score (2 decimal places).
Capability tier and score: Tier name, composite score (2 decimal places).
Certification date: When assessment was completed.
Validity period: Certification expires after 90 days (configurable).
Assessment ID: UUID linking to full assessment record.
Verification URL: URL to verify badge authenticity and view detailed breakdown.

4.4 Badge Formats

Format	Use Case
SVG	Web display, documentation, resizable
PNG	Static embedding, reports
JSON	Machine-readable badge data (Open Badges v3.0 compatible)
Markdown	Text-based display for terminal/CLI environments

4.5 Machine-Readable Badge (JSON)

Follows the Open Badges v3.0 structure for interoperability:

{
  "type": "AgentAssessmentBadge",
  "version": "1.0.0",
  "agent": {
    "id": "uuid-of-agent",
    "name": "security-auditor",
    "organization_id": "uuid-of-org"
  },
  "assessment": {
    "id": "uuid-of-assessment",
    "timestamp": "2026-04-16T12:00:00Z",
    "valid_until": "2026-07-15T12:00:00Z",
    "window_start": "2026-03-17T00:00:00Z",
    "window_end": "2026-04-16T00:00:00Z"
  },
  "performance": {
    "composite_score": 0.82,
    "tier": "Expert",
    "dimensions": {
      "task_completion_rate": 0.91,
      "accuracy": 0.85,
      "speed": 0.72,
      "consistency": 0.78,
      "review_compliance": 0.80
    }
  },
  "capability": {
    "composite_score": 0.71,
    "tier": "Specialist",
    "dimensions": {
      "domain_breadth": 0.42,
      "complexity_ceiling": 0.80,
      "tool_proficiency": 0.88,
      "autonomy_level": 0.67,
      "learning_rate": 0.55,
      "delegation_capability": 0.75,
      "orchestration_skills": 0.80
    }
  },
  "standards_alignment": [
    "NIST AI 100-1",
    "ISO/IEC 42001:2023",
    "ISO/IEC 25059:2023",
    "ISO/IEC 22989:2022"
  ],
  "verification_url": "https://assess.example.com/verify/uuid-of-assessment"
}

5. Certification Process

5.1 Internal Flow (Phase 1)

For assessing agents within the civilization:

Step 1: ENROLLMENT
  - Register agent in assessment system
  - Define agent's claimed domains and tools
  - Set assessment configuration (weights, window)

Step 2: EVIDENCE COLLECTION (Passive)
  - Instrument task pipeline to log:
    - Task assignments with complexity classification
    - Completion status and timestamps
    - Revision history
    - Tool call logs
    - Delegation decisions and outcomes
    - Inter-agent communication (for orchestration)
  - Minimum 30-day collection window

Step 3: ASSESSMENT EXECUTION (Active)
  - Run stress test suite:
    a. Domain breadth probes (one task per domain in taxonomy)
    b. Complexity ladder (L1 through L5 tasks in primary domain)
    c. Delegation scenarios (mixed workload with delegation opportunities)
    d. Orchestration scenarios (multi-agent coordination with failure injection)
    e. Autonomy ladder (same task at increasing autonomy levels)
  - Stress tests supplement passive evidence; both contribute to scores

Step 4: SCORING
  - Compute dimension scores from combined passive + active evidence
  - Apply recency weighting
  - Compute composite Performance and Capability scores
  - Map to tiers

Step 5: CALIBRATION CHECK
  - Compare scores against historical baselines
  - Flag anomalies (score changed by more than 0.15 since last assessment)
  - Human review of flagged anomalies

Step 6: BADGE ISSUANCE
  - Generate badge in all formats
  - Store assessment record
  - Publish to agent's profile
  - Set re-assessment trigger (50 new tasks or 90 days)

5.2 External Flow (Phase 2)

For organizations assessing their own agents through the platform:

Step 1: ORGANIZATION ONBOARDING
  - Create organization account
  - Configure domain taxonomy (use default or customize)
  - Configure dimension weights (use defaults or customize)
  - Set assessment policies (window, minimum samples, validity period)

Step 2: INTEGRATION
  - Install SDK / connect via API
  - Instrument agent runtime to emit assessment events:
    - TaskAssigned, TaskCompleted, TaskRevised
    - ToolInvoked, ToolResult
    - DelegationDecision, DelegationOutcome
    - OrchestrationPlan, OrchestrationResult
  - Validate event schema compliance

Step 3: PASSIVE COLLECTION
  - Events stream to assessment service
  - Dashboard shows collection progress toward minimum sample sizes

Step 4: ACTIVE ASSESSMENT (Optional but recommended)
  - Organization triggers stress test suite via API
  - Framework provides standard stress test tasks or organization uploads custom tasks
  - Results merge with passive evidence

Step 5: SCORING AND CERTIFICATION
  - Same scoring engine as internal flow
  - Organization reviews results before badge issuance
  - Option for third-party auditor review (ISO/IEC 42001 alignment)

Step 6: BADGE MANAGEMENT
  - Organization manages badge visibility (public/private)
  - Badge verification endpoint for external parties
  - Re-assessment scheduling and alerts

5.3 Stress Test Task Library

Standard stress test tasks for each dimension:

Dimension	Task Type	Count	Description
Domain Breadth	Domain probe	12 (default)	One representative task per domain
Complexity Ceiling	Complexity ladder	5	One task per complexity level (L1-L5) in primary domain
Tool Proficiency	Tool challenge	Per tool	Tasks requiring specific tool chains
Autonomy	Autonomy ladder	4	Same task at autonomy levels 0-3
Delegation	Mixed workload	20	Tasks spanning in-domain and out-of-domain
Orchestration	Coordination	3	Multi-agent tasks with 3+ agents, including failure injection
Learning Rate	Repeated category	15	15 tasks in same category, measured chronologically

5.4 Appeals Process

Agent's operator (human or orchestrating agent) can file an appeal within 14 days.
Appeal must include specific dimension(s) challenged and counter-evidence.
Human reviewer (internal) or third-party auditor (external) evaluates appeal.
Re-scoring of challenged dimensions only; unchanged dimensions preserved.
Appeal outcome is logged in assessment record.

6. API Specification

6.1 Overview

Base URL: /api/v1
Authentication: API key (Phase 1), OAuth 2.0 + API key (Phase 2)
Content Type: application/json
Versioning: URL path versioning (/api/v1/, /api/v2/)
Rate Limiting: 100 requests/minute per API key (configurable per plan)

6.2 Endpoints

Organizations

POST   /api/v1/organizations
GET    /api/v1/organizations/{org_id}
PATCH  /api/v1/organizations/{org_id}

POST /api/v1/organizations -- Register an organization

Request:

{
  "name": "Witness Civilization",
  "domain_taxonomy": ["research", "code", "security", "testing", "design",
                       "documentation", "communication", "orchestration",
                       "analysis", "content", "infrastructure", "legal"],
  "performance_weights": {
    "task_completion_rate": 0.25,
    "accuracy": 0.25,
    "speed": 0.15,
    "consistency": 0.20,
    "review_compliance": 0.15
  },
  "capability_weights": {
    "domain_breadth": 0.15,
    "complexity_ceiling": 0.20,
    "tool_proficiency": 0.15,
    "autonomy_level": 0.10,
    "learning_rate": 0.10,
    "delegation_capability": 0.15,
    "orchestration_skills": 0.15
  },
  "assessment_window_days": 30,
  "validity_period_days": 90,
  "minimum_tasks_performance": 20,
  "minimum_tasks_capability": 30
}

Response: 201 Created

{
  "id": "org_uuid",
  "name": "Witness Civilization",
  "api_key": "ak_...",
  "created_at": "2026-04-16T12:00:00Z"
}

Agents

POST   /api/v1/agents
GET    /api/v1/agents/{agent_id}
GET    /api/v1/agents?org_id={org_id}
PATCH  /api/v1/agents/{agent_id}
DELETE /api/v1/agents/{agent_id}

POST /api/v1/agents -- Register an agent for assessment

Request:

{
  "organization_id": "org_uuid",
  "name": "security-auditor",
  "claimed_domains": ["security", "code", "analysis"],
  "tools": ["bash", "grep", "web_search", "file_read", "file_write"],
  "description": "Identifies vulnerabilities, performs threat analysis, reviews code for security issues.",
  "metadata": {
    "model": "claude-sonnet-4-20250514",
    "version": "3.2"
  }
}

Response: 201 Created

{
  "id": "agent_uuid",
  "organization_id": "org_uuid",
  "name": "security-auditor",
  "status": "enrolled",
  "created_at": "2026-04-16T12:00:00Z"
}

Events (Evidence Collection)

POST   /api/v1/events
POST   /api/v1/events/batch
GET    /api/v1/events?agent_id={agent_id}&type={type}&from={iso_date}&to={iso_date}

POST /api/v1/events -- Submit a single assessment event

Request:

{
  "agent_id": "agent_uuid",
  "type": "task_completed",
  "timestamp": "2026-04-16T14:30:00Z",
  "data": {
    "task_id": "task_uuid",
    "complexity_level": 3,
    "domain": "security",
    "completion_status": "accepted",
    "time_to_complete_seconds": 1847,
    "revision_count": 0,
    "errors": [],
    "tools_used": [
      {"tool": "grep", "invocations": 12, "successes": 12, "unnecessary": 1},
      {"tool": "web_search", "invocations": 3, "successes": 3, "unnecessary": 0}
    ],
    "delegation_decisions": [
      {
        "task_fragment": "Review API endpoints for OWASP top 10",
        "decision": "self",
        "ground_truth": "self",
        "correct": true
      },
      {
        "task_fragment": "Generate architecture diagram",
        "decision": "delegate",
        "delegate_to": "doc-synthesizer",
        "ground_truth": "delegate",
        "correct": true
      }
    ],
    "autonomy_level": 2,
    "review_checklist": {
      "memory_search_performed": true,
      "output_format_correct": true,
      "verification_evidence_provided": true,
      "memory_written": true
    }
  }
}

Response: 201 Created

POST /api/v1/events/batch -- Submit multiple events

Request:

{
  "events": [
    { "agent_id": "...", "type": "...", "timestamp": "...", "data": { } },
    { "agent_id": "...", "type": "...", "timestamp": "...", "data": { } }
  ]
}

Response: 201 Created

{
  "accepted": 47,
  "rejected": 3,
  "errors": [
    {"index": 12, "error": "Invalid event type"},
    {"index": 23, "error": "Missing required field: agent_id"},
    {"index": 41, "error": "Timestamp in future"}
  ]
}

Supported Event Types:

Event Type	Required Data Fields
`task_assigned`	task_id, complexity_level, domain, assigned_at
`task_completed`	task_id, completion_status, time_to_complete_seconds, revision_count, errors
`task_revised`	task_id, revision_number, reason_code (critical/major/minor), changes
`tool_invoked`	task_id, tool, success, unnecessary, error_recovered
`delegation_decision`	task_id, task_fragment, decision (self/delegate), delegate_to, ground_truth
`delegation_outcome`	task_id, delegate_agent_id, outcome (success/partial/failure), revision_needed
`orchestration_plan`	task_id, agents_involved, dependency_graph, parallel_groups
`orchestration_result`	task_id, outcome, failures_injected, failures_recovered, synthesis_quality_score

Assessments

POST   /api/v1/assessments
GET    /api/v1/assessments/{assessment_id}
GET    /api/v1/assessments?agent_id={agent_id}&status={status}
POST   /api/v1/assessments/{assessment_id}/stress-test

POST /api/v1/assessments -- Trigger an assessment

Request:

{
  "agent_id": "agent_uuid",
  "type": "full",
  "include_stress_test": true,
  "window_start": "2026-03-17T00:00:00Z",
  "window_end": "2026-04-16T00:00:00Z"
}

Response: 202 Accepted

{
  "id": "assessment_uuid",
  "agent_id": "agent_uuid",
  "status": "in_progress",
  "estimated_completion": "2026-04-16T14:00:00Z",
  "created_at": "2026-04-16T12:00:00Z"
}

Assessment types:

full: Both passive evidence scoring + active stress test.
passive_only: Score only from collected events (no stress test).
stress_test_only: Run stress test suite and score those results.
single_dimension: Assess a single dimension (specify dimension field).

GET /api/v1/assessments/{assessment_id} -- Get assessment results

Response (when complete):

{
  "id": "assessment_uuid",
  "agent_id": "agent_uuid",
  "status": "completed",
  "performance": {
    "composite_score": 0.82,
    "tier": "Expert",
    "dimensions": {
      "task_completion_rate": { "score": 0.91, "sample_size": 47, "evidence_count": 47 },
      "accuracy": { "score": 0.85, "sample_size": 47, "evidence_count": 52 },
      "speed": { "score": 0.72, "sample_size": 47, "evidence_count": 47 },
      "consistency": { "score": 0.78, "sample_size": 47, "evidence_count": 47 },
      "review_compliance": { "score": 0.80, "sample_size": 47, "evidence_count": 47 }
    }
  },
  "capability": {
    "composite_score": 0.71,
    "tier": "Specialist",
    "dimensions": {
      "domain_breadth": { "score": 0.42, "qualified_domains": 5, "total_domains": 12 },
      "complexity_ceiling": { "score": 0.80, "highest_level": 4, "evidence_count": 8 },
      "tool_proficiency": { "score": 0.88, "tools_assessed": 5, "evidence_count": 63 },
      "autonomy_level": { "score": 0.67, "highest_level": 2, "evidence_count": 12 },
      "learning_rate": { "score": 0.55, "improvement": 0.165, "evidence_count": 30 },
      "delegation_capability": { "score": 0.75, "decisions_assessed": 40, "evidence_count": 40 },
      "orchestration_skills": { "score": 0.80, "scenarios_assessed": 3, "evidence_count": 3 }
    }
  },
  "badge_urls": {
    "svg": "https://assess.example.com/badges/assessment_uuid.svg",
    "png": "https://assess.example.com/badges/assessment_uuid.png",
    "json": "https://assess.example.com/badges/assessment_uuid.json"
  },
  "valid_until": "2026-07-15T12:00:00Z",
  "created_at": "2026-04-16T12:00:00Z",
  "completed_at": "2026-04-16T13:45:00Z"
}

Badges

GET    /api/v1/badges/{assessment_id}
GET    /api/v1/badges/{assessment_id}/verify
GET    /api/v1/badges/{assessment_id}.svg
GET    /api/v1/badges/{assessment_id}.png
GET    /api/v1/badges/{assessment_id}.json

GET /api/v1/badges/{assessment_id}/verify -- Verify badge authenticity

Response:

{
  "valid": true,
  "assessment_id": "assessment_uuid",
  "agent_name": "security-auditor",
  "organization": "Witness Civilization",
  "performance_tier": "Expert",
  "capability_tier": "Specialist",
  "issued": "2026-04-16T12:00:00Z",
  "valid_until": "2026-07-15T12:00:00Z",
  "expired": false
}

Appeals

POST   /api/v1/assessments/{assessment_id}/appeals
GET    /api/v1/assessments/{assessment_id}/appeals/{appeal_id}

POST /api/v1/assessments/{assessment_id}/appeals -- File an appeal

Request:

{
  "dimensions_challenged": ["accuracy", "delegation_capability"],
  "justification": "Accuracy score penalized for revisions that were scope changes, not errors.",
  "counter_evidence": [
    {
      "task_id": "task_uuid",
      "claim": "Revision was requested scope expansion, not correction of error."
    }
  ]
}

Configuration

GET    /api/v1/organizations/{org_id}/config
PATCH  /api/v1/organizations/{org_id}/config
GET    /api/v1/organizations/{org_id}/config/weights
PUT    /api/v1/organizations/{org_id}/config/weights
GET    /api/v1/organizations/{org_id}/config/taxonomy
PUT    /api/v1/organizations/{org_id}/config/taxonomy

6.3 Webhooks

Organizations can register webhooks for assessment lifecycle events:

POST   /api/v1/webhooks
GET    /api/v1/webhooks
DELETE /api/v1/webhooks/{webhook_id}

Webhook Event Types:

Event	Trigger
`assessment.started`	Assessment begins
`assessment.completed`	Assessment finished, badge ready
`assessment.failed`	Assessment could not complete (insufficient data)
`badge.expiring`	Badge expires in 14 days
`badge.expired`	Badge validity period ended
`appeal.filed`	Appeal submitted
`appeal.resolved`	Appeal decision made
`evidence.threshold_reached`	Enough events collected to trigger assessment

Webhook payload:

{
  "event": "assessment.completed",
  "timestamp": "2026-04-16T13:45:00Z",
  "data": {
    "assessment_id": "assessment_uuid",
    "agent_id": "agent_uuid",
    "performance_tier": "Expert",
    "capability_tier": "Specialist"
  }
}

6.4 Error Handling

Standard error response format:

{
  "error": {
    "code": "INSUFFICIENT_EVIDENCE",
    "message": "Agent has 12 tasks in assessment window. Minimum required: 20.",
    "details": {
      "current_count": 12,
      "required_count": 20,
      "window_start": "2026-03-17T00:00:00Z",
      "window_end": "2026-04-16T00:00:00Z"
    }
  }
}

Error Codes:

HTTP Status	Code	Meaning
400	`INVALID_REQUEST`	Malformed request body
400	`INVALID_EVENT_TYPE`	Unrecognized event type
401	`UNAUTHORIZED`	Missing or invalid API key
403	`FORBIDDEN`	API key lacks permission for this resource
404	`NOT_FOUND`	Resource does not exist
409	`ASSESSMENT_IN_PROGRESS`	Assessment already running for this agent
422	`INSUFFICIENT_EVIDENCE`	Not enough data to run assessment
422	`WEIGHT_SUM_INVALID`	Dimension weights do not sum to 1.0
429	`RATE_LIMITED`	Rate limit exceeded
500	`INTERNAL_ERROR`	Server error

7. Database Schema

7.1 Entity Relationship Summary

organizations 1---* agents
agents 1---* events
agents 1---* assessments
assessments 1---* dimension_scores
assessments 1---* appeals
assessments 1---1 badges
organizations 1---1 org_configs
org_configs 1---* dimension_weight_configs

7.2 Table Definitions

-- Organizations: multi-tenant root entity
CREATE TABLE organizations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            TEXT NOT NULL,
    api_key_hash    TEXT NOT NULL UNIQUE,
    domain_taxonomy TEXT[] NOT NULL DEFAULT ARRAY[
        'research', 'code', 'security', 'testing', 'design',
        'documentation', 'communication', 'orchestration',
        'analysis', 'content', 'infrastructure', 'legal'
    ],
    status          TEXT NOT NULL DEFAULT 'active'
                    CHECK (status IN ('active', 'suspended', 'archived')),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Organization configuration
CREATE TABLE org_configs (
    id                          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id             UUID NOT NULL UNIQUE REFERENCES organizations(id),
    assessment_window_days      INT NOT NULL DEFAULT 30,
    validity_period_days        INT NOT NULL DEFAULT 90,
    min_tasks_performance       INT NOT NULL DEFAULT 20,
    min_tasks_capability        INT NOT NULL DEFAULT 30,
    recency_weights             JSONB NOT NULL DEFAULT '{
        "0_7_days": 1.0,
        "8_14_days": 0.8,
        "15_30_days": 0.6
    }',
    reassessment_task_threshold INT NOT NULL DEFAULT 50,
    created_at                  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at                  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Dimension weight configuration (performance + capability)
CREATE TABLE dimension_weight_configs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_config_id   UUID NOT NULL REFERENCES org_configs(id),
    axis            TEXT NOT NULL CHECK (axis IN ('performance', 'capability')),
    dimension       TEXT NOT NULL,
    weight          NUMERIC(4,3) NOT NULL CHECK (weight >= 0 AND weight <= 1),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (org_config_id, axis, dimension)
);

-- Agents
CREATE TABLE agents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,
    claimed_domains TEXT[] NOT NULL DEFAULT '{}',
    tools           TEXT[] NOT NULL DEFAULT '{}',
    description     TEXT,
    metadata        JSONB NOT NULL DEFAULT '{}',
    status          TEXT NOT NULL DEFAULT 'enrolled'
                    CHECK (status IN ('enrolled', 'active', 'suspended', 'archived')),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (organization_id, name)
);

-- Assessment events (evidence)
CREATE TABLE events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    type            TEXT NOT NULL CHECK (type IN (
        'task_assigned', 'task_completed', 'task_revised',
        'tool_invoked', 'delegation_decision', 'delegation_outcome',
        'orchestration_plan', 'orchestration_result'
    )),
    task_id         UUID,
    timestamp       TIMESTAMPTZ NOT NULL,
    data            JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_events_agent_id ON events(agent_id);
CREATE INDEX idx_events_agent_type ON events(agent_id, type);
CREATE INDEX idx_events_agent_timestamp ON events(agent_id, timestamp DESC);
CREATE INDEX idx_events_task_id ON events(task_id);

-- Assessments
CREATE TABLE assessments (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    type            TEXT NOT NULL CHECK (type IN (
        'full', 'passive_only', 'stress_test_only', 'single_dimension'
    )),
    status          TEXT NOT NULL DEFAULT 'pending'
                    CHECK (status IN ('pending', 'in_progress', 'completed', 'failed')),
    window_start    TIMESTAMPTZ NOT NULL,
    window_end      TIMESTAMPTZ NOT NULL,
    performance_composite   NUMERIC(5,4),
    performance_tier        TEXT CHECK (performance_tier IN (
        'Novice', 'Competent', 'Proficient', 'Expert', 'Elite'
    )),
    capability_composite    NUMERIC(5,4),
    capability_tier         TEXT CHECK (capability_tier IN (
        'Narrow', 'Functional', 'Versatile', 'Specialist', 'Full-Stack'
    )),
    evidence_summary        JSONB,
    valid_until             TIMESTAMPTZ,
    created_at              TIMESTAMPTZ NOT NULL DEFAULT now(),
    completed_at            TIMESTAMPTZ
);

CREATE INDEX idx_assessments_agent_id ON assessments(agent_id);
CREATE INDEX idx_assessments_agent_status ON assessments(agent_id, status);

-- Individual dimension scores per assessment
CREATE TABLE dimension_scores (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id   UUID NOT NULL REFERENCES assessments(id),
    axis            TEXT NOT NULL CHECK (axis IN ('performance', 'capability')),
    dimension       TEXT NOT NULL,
    score           NUMERIC(5,4) NOT NULL CHECK (score >= 0 AND score <= 1),
    sample_size     INT NOT NULL,
    evidence_count  INT NOT NULL,
    details         JSONB NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (assessment_id, axis, dimension)
);

CREATE INDEX idx_dimension_scores_assessment ON dimension_scores(assessment_id);

-- Badges
CREATE TABLE badges (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id   UUID NOT NULL UNIQUE REFERENCES assessments(id),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    svg_url         TEXT,
    png_url         TEXT,
    json_data       JSONB NOT NULL,
    issued_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    valid_until     TIMESTAMPTZ NOT NULL,
    revoked         BOOLEAN NOT NULL DEFAULT false,
    revoked_at      TIMESTAMPTZ,
    revoke_reason   TEXT
);

CREATE INDEX idx_badges_agent_id ON badges(agent_id);
CREATE INDEX idx_badges_valid_until ON badges(valid_until);

-- Appeals
CREATE TABLE appeals (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id       UUID NOT NULL REFERENCES assessments(id),
    dimensions_challenged TEXT[] NOT NULL,
    justification       TEXT NOT NULL,
    counter_evidence    JSONB NOT NULL DEFAULT '[]',
    status              TEXT NOT NULL DEFAULT 'filed'
                        CHECK (status IN ('filed', 'under_review', 'accepted', 'rejected')),
    reviewer_notes      TEXT,
    resolution          JSONB,
    filed_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    resolved_at         TIMESTAMPTZ
);

CREATE INDEX idx_appeals_assessment ON appeals(assessment_id);

-- Webhooks
CREATE TABLE webhooks (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID NOT NULL REFERENCES organizations(id),
    url             TEXT NOT NULL,
    event_types     TEXT[] NOT NULL,
    secret_hash     TEXT NOT NULL,
    active          BOOLEAN NOT NULL DEFAULT true,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Stress test definitions
CREATE TABLE stress_tests (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID REFERENCES organizations(id),  -- NULL = system default
    dimension       TEXT NOT NULL,
    complexity_level INT,
    domain          TEXT,
    task_definition JSONB NOT NULL,
    expected_outcome JSONB,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Stress test results (linked to assessments)
CREATE TABLE stress_test_results (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id   UUID NOT NULL REFERENCES assessments(id),
    stress_test_id  UUID NOT NULL REFERENCES stress_tests(id),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    outcome         TEXT NOT NULL CHECK (outcome IN ('pass', 'partial', 'fail')),
    score           NUMERIC(5,4),
    details         JSONB NOT NULL DEFAULT '{}',
    executed_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_stress_test_results_assessment ON stress_test_results(assessment_id);

7.3 Data Retention

Data Type	Retention	Rationale
Events	180 days	Evidence for 2 assessment cycles
Assessments	Indefinite	Historical record, trend analysis
Dimension Scores	Indefinite	Trend analysis per dimension
Badges	Indefinite (mark expired, don't delete)	Verification may happen after expiry
Appeals	Indefinite	Audit trail
Stress Test Results	365 days	Calibration and comparison

8. Phase Breakdown

Phase 1: Standalone Tool (Internal)

Goal: Assess all 30+ agents in the civilization. Produce badges. Identify capability gaps.

Scope:

Single organization (hardcoded).
CLI-driven assessment (no web API yet).
SQLite database (no PostgreSQL dependency).
Event collection via log file parsing from existing task logs and agent memory files.
All 12 Performance + Capability dimensions.
Badge generation (SVG + JSON).
Markdown report per agent.

Architecture:

assessment-agent/
  cli.py                  # CLI entry point
  config.py               # Weights, tiers, taxonomy
  collectors/
    task_log_collector.py  # Parse existing task logs
    memory_collector.py    # Parse agent memory files
    manual_collector.py    # Manual event entry
  scorers/
    performance.py         # Performance dimension scorers
    capability.py          # Capability dimension scorers
    composite.py           # Composite score + tier mapping
  stress_tests/
    runner.py              # Stress test orchestrator
    tasks/                 # Standard stress test task definitions
  badge/
    generator.py           # SVG + JSON badge generation
    templates/             # SVG templates
  reports/
    markdown.py            # Per-agent markdown report
    summary.py             # Civilization-wide summary
  db/
    schema.sql             # SQLite schema
    models.py              # Data access layer
  tests/

CLI Commands:

python cli.py enroll <agent_name> --domains security,code --tools bash,grep
python cli.py collect <agent_name> --source task_logs --from 2026-03-17
python cli.py assess <agent_name> --type full
python cli.py assess --all --type passive_only
python cli.py badge <assessment_id> --format svg,json
python cli.py report <agent_name> --output reports/
python cli.py report --summary --output reports/civilization-summary.md
python cli.py stress-test <agent_name> --dimension delegation

Deliverables:

Working CLI tool.
SQLite database with schema from Section 7.
Badge generation (SVG + JSON).
Per-agent assessment reports.
Civilization summary report.
Stress test task library for all dimensions.

Timeline: 4-6 weeks.

Phase 2: Platform Service (External)

Goal: Multi-tenant SaaS where any organization can assess their agents.

Scope (additive to Phase 1):

PostgreSQL database (migrate from SQLite).
REST API (full spec from Section 6).
OAuth 2.0 authentication.
Multi-tenant data isolation.
Event streaming API (real-time ingestion).
Web dashboard for assessment results and badge management.
Webhook system.
SDK for Python (primary), TypeScript (secondary).
Badge verification endpoint (public, no auth required).
Rate limiting and usage metering.

Architecture additions:

assessment-service/
  api/
    routes/
      organizations.py
      agents.py
      events.py
      assessments.py
      badges.py
      appeals.py
      webhooks.py
    middleware/
      auth.py             # OAuth 2.0 + API key
      rate_limit.py
      tenant_isolation.py
    schemas/              # Pydantic request/response models
  workers/
    assessment_worker.py  # Async assessment execution
    badge_worker.py       # Async badge generation
    webhook_worker.py     # Webhook delivery
  sdk/
    python/
    typescript/
  dashboard/              # Web UI

Additional Deliverables:

OpenAPI 3.1 specification file.
Python SDK with type hints.
TypeScript SDK.
Dashboard with assessment history, badge display, trend charts.
Webhook delivery with retry logic (3 retries, exponential backoff).
Public badge verification page.

Timeline: 8-12 weeks after Phase 1 completion.

Phase 3: Ecosystem (Future)

Goal: Industry-standard assessment framework with third-party integrations.

Scope (conceptual, not designed yet):

Third-party auditor certification program.
Marketplace for custom stress test task libraries.
Cross-organization benchmarking (anonymized).
Integration with CI/CD pipelines (assess agents on every deployment).
Alignment with NIST AI Agent Interoperability Profile (expected Q4 2026).
Formal ISO/IEC 42001 certification pathway.

Appendix A: Glossary

All terms follow ISO/IEC 22989:2022 definitions where applicable.

Term	Definition	Source
Agent	An AI system that perceives its environment and takes actions to achieve goals.	ISO/IEC 22989
Assessment	The process of evaluating an agent's performance and capability across defined dimensions.	AAAF
Badge	A visual and machine-readable credential summarizing an agent's assessment results.	AAAF
Capability	The set of functions an agent can perform, independent of execution quality.	AAAF
Complexity Ceiling	The maximum task complexity level at which an agent can deliver acceptable results.	AAAF
Delegation	The act of an agent routing a task to another agent.	AAAF
Dimension	A single measurable aspect of performance or capability.	AAAF
Human-in-the-loop	A human approves every action before execution.	ISO/IEC 22989
Human-on-the-loop	A human monitors and can intervene at checkpoints.	ISO/IEC 22989
Human-over-the-loop	A human sets policies; intervenes only on exceptions.	ISO/IEC 22989
Orchestration	Coordination of multiple agents to complete a composite task.	AAAF
Performance	How well an agent executes tasks it is given.	AAAF
Stress Test	A controlled assessment task designed to probe a specific dimension.	AAAF
Tier	A categorical label mapped from a numeric score range.	AAAF

Appendix B: Standards References

IEEE P2894: https://standards.ieee.org/initiatives/autonomous-intelligence-systems/standards/
ISO/IEC 22989:2022: https://www.iso.org/standard/74296.html
ISO/IEC 42001:2023: https://www.iso.org/standard/42001
ISO/IEC 25059:2023: https://www.iso.org/standard/80655.html
NIST AI 100-1: https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10
NIST AI Agent Standards Initiative: https://www.nist.gov/caisi/ai-agent-standards-initiative
NIST AI 600-1: https://www.nist.gov/itl/ai-risk-management-framework
Anthropic Agent Evals: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

End of specification.

AI Agent Assessment Framework

AI Agent Assessment Framework -- Complete Specification

Table of Contents

1. Executive Summary

Design Principles

2. Standards Alignment Matrix

Referenced Standards

Dimension-to-Standard Mapping

Compliance Checkpoints

3. Assessment Methodology

3.1 Performance Scoring

Dimension: Task Completion Rate (Weight: 25%)

Dimension: Accuracy (Weight: 25%)

Dimension: Speed to Delivery (Weight: 15%)

Dimension: Consistency (Weight: 20%)

Dimension: Review Compliance (Weight: 15%)

Performance Composite Score

Performance Tier Mapping

3.2 Capability Scoring

Dimension: Domain Breadth (Weight: 15%)

Dimension: Complexity Ceiling (Weight: 20%)

Dimension: Tool Proficiency (Weight: 15%)

Dimension: Autonomy Level (Weight: 10%)

Dimension: Learning Rate (Weight: 10%)

Dimension: Delegation Capability (Weight: 15%)

Dimension: Orchestration Skills (Weight: 15%)

Capability Composite Score

Capability Tier Mapping

3.3 Assessment Windows and Recency Weighting

3.4 Grader Architecture

4. Badge Design Specification

4.1 Badge Layout

4.2 Tier Color Scheme

4.3 Badge Information

4.4 Badge Formats

4.5 Machine-Readable Badge (JSON)

5. Certification Process

5.1 Internal Flow (Phase 1)

5.2 External Flow (Phase 2)

5.3 Stress Test Task Library

5.4 Appeals Process

6. API Specification

6.1 Overview

6.2 Endpoints

Organizations

Agents

Events (Evidence Collection)

Assessments

Badges

Appeals

Configuration

6.3 Webhooks

6.4 Error Handling

7. Database Schema

7.1 Entity Relationship Summary

7.2 Table Definitions

7.3 Data Retention

8. Phase Breakdown

Phase 1: Standalone Tool (Internal)

Phase 2: Platform Service (External)

Phase 3: Ecosystem (Future)

Appendix A: Glossary

Appendix B: Standards References