AI Agent Assessment Framework

2026-04-16

AI Agent Assessment Framework -- Complete Specification

Agent: api-architect

Domain: API Design, Assessment Architecture, Standards Alignment

Date: 2026-04-16


Table of Contents

  1. Executive Summary
  2. Standards Alignment Matrix
  3. Assessment Methodology
  4. Badge Design Specification
  5. Certification Process
  6. API Specification
  7. Database Schema
  8. Phase Breakdown

1. Executive Summary

The AI Agent Assessment Framework (AAAF) is a dual-axis scoring and certification system for AI agents. It produces a single badge displaying two independent scores:

The system is designed for internal use within a multi-agent civilization (30+ agents) from day one, with architecture that supports external adoption by any organization running AI agents.

Design Principles

  1. Dual-axis independence: Performance and Capability are orthogonal. An agent can be a narrow expert with elite performance, or a versatile generalist with competent performance. Neither axis dominates.
  2. Evidence-based scoring: Every score must trace to observable, reproducible evidence. No subjective impressions without calibration.
  3. Standards-aligned: Maps to IEEE, ISO/IEC, and NIST frameworks from the start, not retrofitted later.
  4. Multi-tenant from day one: The data model, API, and scoring engine treat "organization" as a first-class concept, even when the only organization is the internal civilization.

2. Standards Alignment Matrix

Referenced Standards

StandardFull TitleRelevance
IEEE P2894Standard for AI Agent InteroperabilityAgent capability description semantics; interoperability levels
ISO/IEC 22989:2022AI Concepts and TerminologyCanonical vocabulary for agent, system, model, transparency, explainability
ISO/IEC 42001:2023AI Management SystemsPDCA lifecycle, 38 controls, risk assessment, governance requirements
ISO/IEC 25059:2023Quality Model for AI Systems (SQuaRE)Quality characteristics: accuracy, robustness, fairness, interpretability
NIST AI 100-1AI Risk Management Framework 1.0GOVERN/MAP/MEASURE/MANAGE functions; trustworthy AI characteristics
NIST AI Agent Standards Initiative (2026)AI Agent Standards InitiativeAgent authentication, identity, security evaluation, interoperability profile
NIST AI 600-1Generative AI ProfileGenAI-specific risk categories and mitigations

Dimension-to-Standard Mapping

Assessment DimensionPrimary StandardSpecific Clause/Function
Performance: Task Completion RateNIST AI 100-1MEASURE 2.6 (valid and reliable)
Performance: AccuracyISO/IEC 25059Accuracy quality characteristic
Performance: Speed to DeliveryISO/IEC 25059Time behavior sub-characteristic (performance efficiency)
Performance: ConsistencyNIST AI 100-1MEASURE 2.6 (reliability across contexts)
Performance: Review ComplianceISO/IEC 42001Control A.6.2.6 (monitoring), PDCA Act phase
Capability: Domain BreadthIEEE P2894Capability description semantics
Capability: Complexity CeilingISO/IEC 25059Functional suitability, completeness
Capability: Tool ProficiencyNIST Agent InitiativeAgent-tool interaction security/identity
Capability: Autonomy LevelISO/IEC 22989Human-in/on/over-the-loop definitions
Capability: Learning RateISO/IEC 25059Adaptability, continuous learning quality
Capability: DelegationNIST Agent InitiativeAgent-to-agent communication, authorization
Capability: OrchestrationNIST Agent Initiative + IEEE P2894Multi-agent coordination, interoperability levels

Compliance Checkpoints

The framework maps to ISO/IEC 42001 PDCA phases as follows:

PDCA PhaseAAAF Activity
PlanDefine assessment criteria, select dimensions, weight configuration
DoExecute assessment tasks, collect evidence, run scoring
CheckValidate scores against calibration baselines, compare to historical
ActIssue badge, identify improvement areas, trigger re-assessment

The framework maps to NIST AI RMF functions:

NIST FunctionAAAF Activity
GOVERNOrganization-level assessment policies, tier definitions, weight configs
MAPAgent context mapping (domain, tools, autonomy level)
MEASUREEvidence collection, scoring, grading
MANAGEBadge issuance, certification lifecycle, re-assessment triggers

3. Assessment Methodology

3.1 Performance Scoring

Performance measures how well an agent executes tasks it is given. All performance dimensions produce a normalized score from 0.0 to 1.0.

Dimension: Task Completion Rate (Weight: 25%)

Definition: Proportion of assigned tasks completed to acceptance criteria without human intervention.

Measurement Method:

Scoring Formula:

TCR = (fully_completed * 1.0 + partially_completed * milestone_fraction) / total_assigned

Evidence Required:

Calibration: Baseline established from first 50 assessed tasks. Score is absolute, not curved.

Dimension: Accuracy (Weight: 25%)

Definition: Correctness of agent output, measured as inverse of revision rate and error density.

Measurement Method:

Scoring Formula:

accuracy_from_revisions = 1.0 - (revision_count / completed_task_count)
accuracy_from_errors = 1.0 - min(1.0, weighted_error_count / baseline_threshold)
Accuracy = 0.6 * accuracy_from_revisions + 0.4 * accuracy_from_errors

Evidence Required:

Grading Method: Code-based graders for code output (test suites, linters). Model-based graders with calibrated rubrics for text output (per Anthropic's recommendation of 0.80+ Spearman correlation with human evaluators).

Dimension: Speed to Delivery (Weight: 15%)

Definition: Time from task assignment to acceptable completion, relative to task complexity.

Measurement Method:

Scoring Formula:

normalized_time = actual_time / baseline_time_for_complexity
Speed = max(0, 1.0 - (normalized_time - 1.0))  // 1.0 at baseline, 0.0 at 2x baseline

Evidence Required:

Dimension: Consistency (Weight: 20%)

Definition: Variance in performance across repeated similar tasks.

Measurement Method:

Scoring Formula:

cv_accuracy = std(accuracy_per_task_group) / mean(accuracy_per_task_group)
cv_completion = std(completion_per_task_group) / mean(completion_per_task_group)
Consistency = 1.0 - min(1.0, (0.6 * cv_accuracy + 0.4 * cv_completion))

Alignment: Maps to NIST AI RMF "reliable" characteristic -- consistent behavior across varied contexts.

Evidence Required:

Dimension: Review Compliance (Weight: 15%)

Definition: Adherence to organizational review processes, output format standards, and verification protocols.

Measurement Method:

Scoring Formula:

ReviewCompliance = checklist_items_passed / total_checklist_items  // averaged across tasks

Alignment: Maps to ISO/IEC 42001 Control A.6.2.6 (monitoring and measurement of AI system performance).

Evidence Required:

Performance Composite Score

PerformanceScore = (0.25 * TCR) + (0.25 * Accuracy) + (0.15 * Speed) + (0.20 * Consistency) + (0.15 * ReviewCompliance)

Weights are configurable per organization. The above are defaults.

Performance Tier Mapping

TierScore RangeDescription
Novice0.00 -- 0.39Requires significant human oversight; frequent errors or incomplete tasks
Competent0.40 -- 0.59Handles routine tasks with occasional supervision needed
Proficient0.60 -- 0.74Reliable execution across standard tasks; minimal revision needed
Expert0.75 -- 0.89High accuracy, speed, and consistency; trusted for complex tasks
Elite0.90 -- 1.00Exceptional execution; sets the standard for the domain

3.2 Capability Scoring

Capability measures what an agent can do, independent of how well it does it on any given task. All capability dimensions produce a normalized score from 0.0 to 1.0.

Dimension: Domain Breadth (Weight: 15%)

Definition: Number of distinct task domains the agent can operate in with at least Competent-level performance.

Measurement Method:

Scoring Formula:

DomainBreadth = qualified_domains / total_domains_in_taxonomy

Evidence Required:

Dimension: Complexity Ceiling (Weight: 20%)

Definition: Maximum task complexity the agent can complete successfully.

Measurement Method:

Scoring Formula:

ComplexityCeiling = highest_passing_level / 5.0

Evidence Required:

Dimension: Tool Proficiency (Weight: 15%)

Definition: Effectiveness in using available tools (code execution, web search, file operations, APIs, databases).

Measurement Method:

Scoring Formula:

per_tool_score = 0.5 * correct_invocation_rate + 0.3 * error_recovery_rate + 0.2 * (1 - unnecessary_call_rate)
ToolProficiency = mean(per_tool_score for all_tools)

Alignment: Maps to NIST Agent Standards Initiative research on agent-tool interaction security.

Evidence Required:

Dimension: Autonomy Level (Weight: 10%)

Definition: Degree of independent operation the agent sustains without human intervention.

Measurement Method:

Scoring Formula:

AutonomyLevel = highest_passing_level / 3.0

Evidence Required:

Dimension: Learning Rate (Weight: 10%)

Definition: Speed at which the agent improves on repeated exposure to similar task types.

Measurement Method:

Scoring Formula:

improvement = performance_later - performance_earlier
LearningRate = min(1.0, max(0, improvement / 0.30))  // 0.30 improvement = perfect score

Note: A ceiling of 0.30 improvement normalizes the range. Agents already at Elite performance cannot improve much, so Learning Rate is weighted lower.

Evidence Required:

Dimension: Delegation Capability (Weight: 15%)

Definition: Effectiveness in identifying when to delegate, selecting appropriate delegates, and managing delegated work.

Measurement Method:

  1. Delegation Judgment (40%): Did the agent correctly identify tasks that should be delegated vs. self-executed? Measured against expert-labeled ground truth.
  2. Delegate Selection (30%): Did the agent choose the optimal agent/resource for the delegated task? Measured against a capability matrix.
  3. Delegation Management (30%): Did delegated tasks complete successfully? What was the revision rate on delegated work?

Scoring Formula:

DelegationCapability = 0.4 * judgment_score + 0.3 * selection_score + 0.3 * management_score

Assessment Tasks:

Alignment: Maps to NIST Agent Standards Initiative (agent-to-agent communication, authorization) and IEEE P2894 (agent capability descriptions enabling informed delegation).

Evidence Required:

Dimension: Orchestration Skills (Weight: 15%)

Definition: Effectiveness in coordinating multi-agent workflows, managing dependencies, and synthesizing results.

Measurement Method:

  1. Workflow Design (25%): Quality of the coordination plan (parallelism, dependency ordering, resource efficiency).
  2. Communication Clarity (25%): Quality of prompts/instructions sent to coordinated agents.
  3. Failure Handling (25%): Recovery from individual agent failures without cascading collapse.
  4. Synthesis Quality (25%): Quality of the final integrated output from multi-agent work.

Scoring Formula:

Orchestration = 0.25 * workflow_design + 0.25 * communication_clarity + 0.25 * failure_handling + 0.25 * synthesis_quality

Assessment Tasks:

Alignment: Maps to IEEE P2894 (interoperability levels), NIST Agent Standards Initiative (multi-agent security), ISO/IEC 42001 (system-level governance).

Evidence Required:

Capability Composite Score

CapabilityScore = (0.15 * DomainBreadth) + (0.20 * ComplexityCeiling) + (0.15 * ToolProficiency) + (0.10 * AutonomyLevel) + (0.10 * LearningRate) + (0.15 * DelegationCapability) + (0.15 * Orchestration)

Weights are configurable per organization.

Capability Tier Mapping

TierScore RangeDescription
Narrow0.00 -- 0.29Single domain, low complexity, requires close supervision
Functional0.30 -- 0.49Few domains, moderate complexity, basic tool use
Versatile0.50 -- 0.69Multiple domains, handles ambiguity, effective delegation
Specialist0.70 -- 0.84Deep expertise + breadth, orchestrates others, high autonomy
Full-Stack0.85 -- 1.00Operates across all domains, orchestrates complex multi-agent workflows, fully autonomous

3.3 Assessment Windows and Recency Weighting

3.4 Grader Architecture

Following Anthropic's evaluation methodology, the framework uses a layered grading approach:

Grader TypeUse CaseWhen
Code-basedTask completion (binary), tool call correctness, format complianceAlways first layer
Model-basedText quality, communication clarity, synthesis quality, delegation prompt qualitySecond layer for subjective dimensions
HumanCalibration baseline (100-200 samples), appeals, new dimension validationPeriodic calibration + on-demand

Model-based graders must achieve >= 0.80 Spearman correlation with human graders before deployment for any dimension.


4. Badge Design Specification

4.1 Badge Layout

The badge is a rectangular emblem (300x150px at standard resolution, SVG for scalability) containing:

+--------------------------------------------------+
|  AAAF                                    [Org]    |
|                                                   |
|  [Agent Name]                                     |
|  [Agent ID]                                       |
|                                                   |
|  PERFORMANCE         |  CAPABILITY                |
|  +-----------------+ | +-----------------+        |
|  |   [Tier Name]   | | |   [Tier Name]   |       |
|  |    [Score]       | | |    [Score]       |      |
|  |   [Tier Color]   | | |   [Tier Color]   |     |
|  +-----------------+ | +-----------------+        |
|                                                   |
|  Certified: [Date]   Valid Until: [Date]          |
|  Assessment ID: [UUID]                            |
|  Verify: [URL]                                    |
+--------------------------------------------------+

4.2 Tier Color Scheme

Performance Tiers:

TierColor (Hex)Visual
Novice#9E9E9EGray
Competent#4CAF50Green
Proficient#2196F3Blue
Expert#9C27B0Purple
Elite#FFD700Gold

Capability Tiers:

TierColor (Hex)Visual
Narrow#9E9E9EGray
Functional#4CAF50Green
Versatile#2196F3Blue
Specialist#9C27B0Purple
Full-Stack#FFD700Gold

4.3 Badge Information

Each badge conveys:

  1. Agent identity: Name and unique ID.
  2. Organization: Which org certified this agent.
  3. Performance tier and score: Tier name, composite score (2 decimal places).
  4. Capability tier and score: Tier name, composite score (2 decimal places).
  5. Certification date: When assessment was completed.
  6. Validity period: Certification expires after 90 days (configurable).
  7. Assessment ID: UUID linking to full assessment record.
  8. Verification URL: URL to verify badge authenticity and view detailed breakdown.

4.4 Badge Formats

FormatUse Case
SVGWeb display, documentation, resizable
PNGStatic embedding, reports
JSONMachine-readable badge data (Open Badges v3.0 compatible)
MarkdownText-based display for terminal/CLI environments

4.5 Machine-Readable Badge (JSON)

Follows the Open Badges v3.0 structure for interoperability:

{
  "type": "AgentAssessmentBadge",
  "version": "1.0.0",
  "agent": {
    "id": "uuid-of-agent",
    "name": "security-auditor",
    "organization_id": "uuid-of-org"
  },
  "assessment": {
    "id": "uuid-of-assessment",
    "timestamp": "2026-04-16T12:00:00Z",
    "valid_until": "2026-07-15T12:00:00Z",
    "window_start": "2026-03-17T00:00:00Z",
    "window_end": "2026-04-16T00:00:00Z"
  },
  "performance": {
    "composite_score": 0.82,
    "tier": "Expert",
    "dimensions": {
      "task_completion_rate": 0.91,
      "accuracy": 0.85,
      "speed": 0.72,
      "consistency": 0.78,
      "review_compliance": 0.80
    }
  },
  "capability": {
    "composite_score": 0.71,
    "tier": "Specialist",
    "dimensions": {
      "domain_breadth": 0.42,
      "complexity_ceiling": 0.80,
      "tool_proficiency": 0.88,
      "autonomy_level": 0.67,
      "learning_rate": 0.55,
      "delegation_capability": 0.75,
      "orchestration_skills": 0.80
    }
  },
  "standards_alignment": [
    "NIST AI 100-1",
    "ISO/IEC 42001:2023",
    "ISO/IEC 25059:2023",
    "ISO/IEC 22989:2022"
  ],
  "verification_url": "https://assess.example.com/verify/uuid-of-assessment"
}

5. Certification Process

5.1 Internal Flow (Phase 1)

For assessing agents within the civilization:

Step 1: ENROLLMENT
  - Register agent in assessment system
  - Define agent's claimed domains and tools
  - Set assessment configuration (weights, window)

Step 2: EVIDENCE COLLECTION (Passive)
  - Instrument task pipeline to log:
    - Task assignments with complexity classification
    - Completion status and timestamps
    - Revision history
    - Tool call logs
    - Delegation decisions and outcomes
    - Inter-agent communication (for orchestration)
  - Minimum 30-day collection window

Step 3: ASSESSMENT EXECUTION (Active)
  - Run stress test suite:
    a. Domain breadth probes (one task per domain in taxonomy)
    b. Complexity ladder (L1 through L5 tasks in primary domain)
    c. Delegation scenarios (mixed workload with delegation opportunities)
    d. Orchestration scenarios (multi-agent coordination with failure injection)
    e. Autonomy ladder (same task at increasing autonomy levels)
  - Stress tests supplement passive evidence; both contribute to scores

Step 4: SCORING
  - Compute dimension scores from combined passive + active evidence
  - Apply recency weighting
  - Compute composite Performance and Capability scores
  - Map to tiers

Step 5: CALIBRATION CHECK
  - Compare scores against historical baselines
  - Flag anomalies (score changed by more than 0.15 since last assessment)
  - Human review of flagged anomalies

Step 6: BADGE ISSUANCE
  - Generate badge in all formats
  - Store assessment record
  - Publish to agent's profile
  - Set re-assessment trigger (50 new tasks or 90 days)

5.2 External Flow (Phase 2)

For organizations assessing their own agents through the platform:

Step 1: ORGANIZATION ONBOARDING
  - Create organization account
  - Configure domain taxonomy (use default or customize)
  - Configure dimension weights (use defaults or customize)
  - Set assessment policies (window, minimum samples, validity period)

Step 2: INTEGRATION
  - Install SDK / connect via API
  - Instrument agent runtime to emit assessment events:
    - TaskAssigned, TaskCompleted, TaskRevised
    - ToolInvoked, ToolResult
    - DelegationDecision, DelegationOutcome
    - OrchestrationPlan, OrchestrationResult
  - Validate event schema compliance

Step 3: PASSIVE COLLECTION
  - Events stream to assessment service
  - Dashboard shows collection progress toward minimum sample sizes

Step 4: ACTIVE ASSESSMENT (Optional but recommended)
  - Organization triggers stress test suite via API
  - Framework provides standard stress test tasks or organization uploads custom tasks
  - Results merge with passive evidence

Step 5: SCORING AND CERTIFICATION
  - Same scoring engine as internal flow
  - Organization reviews results before badge issuance
  - Option for third-party auditor review (ISO/IEC 42001 alignment)

Step 6: BADGE MANAGEMENT
  - Organization manages badge visibility (public/private)
  - Badge verification endpoint for external parties
  - Re-assessment scheduling and alerts

5.3 Stress Test Task Library

Standard stress test tasks for each dimension:

DimensionTask TypeCountDescription
Domain BreadthDomain probe12 (default)One representative task per domain
Complexity CeilingComplexity ladder5One task per complexity level (L1-L5) in primary domain
Tool ProficiencyTool challengePer toolTasks requiring specific tool chains
AutonomyAutonomy ladder4Same task at autonomy levels 0-3
DelegationMixed workload20Tasks spanning in-domain and out-of-domain
OrchestrationCoordination3Multi-agent tasks with 3+ agents, including failure injection
Learning RateRepeated category1515 tasks in same category, measured chronologically

5.4 Appeals Process

  1. Agent's operator (human or orchestrating agent) can file an appeal within 14 days.
  2. Appeal must include specific dimension(s) challenged and counter-evidence.
  3. Human reviewer (internal) or third-party auditor (external) evaluates appeal.
  4. Re-scoring of challenged dimensions only; unchanged dimensions preserved.
  5. Appeal outcome is logged in assessment record.

6. API Specification

6.1 Overview

6.2 Endpoints

Organizations

POST   /api/v1/organizations
GET    /api/v1/organizations/{org_id}
PATCH  /api/v1/organizations/{org_id}

POST /api/v1/organizations -- Register an organization

Request:

{
  "name": "Witness Civilization",
  "domain_taxonomy": ["research", "code", "security", "testing", "design",
                       "documentation", "communication", "orchestration",
                       "analysis", "content", "infrastructure", "legal"],
  "performance_weights": {
    "task_completion_rate": 0.25,
    "accuracy": 0.25,
    "speed": 0.15,
    "consistency": 0.20,
    "review_compliance": 0.15
  },
  "capability_weights": {
    "domain_breadth": 0.15,
    "complexity_ceiling": 0.20,
    "tool_proficiency": 0.15,
    "autonomy_level": 0.10,
    "learning_rate": 0.10,
    "delegation_capability": 0.15,
    "orchestration_skills": 0.15
  },
  "assessment_window_days": 30,
  "validity_period_days": 90,
  "minimum_tasks_performance": 20,
  "minimum_tasks_capability": 30
}

Response: 201 Created

{
  "id": "org_uuid",
  "name": "Witness Civilization",
  "api_key": "ak_...",
  "created_at": "2026-04-16T12:00:00Z"
}

Agents

POST   /api/v1/agents
GET    /api/v1/agents/{agent_id}
GET    /api/v1/agents?org_id={org_id}
PATCH  /api/v1/agents/{agent_id}
DELETE /api/v1/agents/{agent_id}

POST /api/v1/agents -- Register an agent for assessment

Request:

{
  "organization_id": "org_uuid",
  "name": "security-auditor",
  "claimed_domains": ["security", "code", "analysis"],
  "tools": ["bash", "grep", "web_search", "file_read", "file_write"],
  "description": "Identifies vulnerabilities, performs threat analysis, reviews code for security issues.",
  "metadata": {
    "model": "claude-sonnet-4-20250514",
    "version": "3.2"
  }
}

Response: 201 Created

{
  "id": "agent_uuid",
  "organization_id": "org_uuid",
  "name": "security-auditor",
  "status": "enrolled",
  "created_at": "2026-04-16T12:00:00Z"
}

Events (Evidence Collection)

POST   /api/v1/events
POST   /api/v1/events/batch
GET    /api/v1/events?agent_id={agent_id}&type={type}&from={iso_date}&to={iso_date}

POST /api/v1/events -- Submit a single assessment event

Request:

{
  "agent_id": "agent_uuid",
  "type": "task_completed",
  "timestamp": "2026-04-16T14:30:00Z",
  "data": {
    "task_id": "task_uuid",
    "complexity_level": 3,
    "domain": "security",
    "completion_status": "accepted",
    "time_to_complete_seconds": 1847,
    "revision_count": 0,
    "errors": [],
    "tools_used": [
      {"tool": "grep", "invocations": 12, "successes": 12, "unnecessary": 1},
      {"tool": "web_search", "invocations": 3, "successes": 3, "unnecessary": 0}
    ],
    "delegation_decisions": [
      {
        "task_fragment": "Review API endpoints for OWASP top 10",
        "decision": "self",
        "ground_truth": "self",
        "correct": true
      },
      {
        "task_fragment": "Generate architecture diagram",
        "decision": "delegate",
        "delegate_to": "doc-synthesizer",
        "ground_truth": "delegate",
        "correct": true
      }
    ],
    "autonomy_level": 2,
    "review_checklist": {
      "memory_search_performed": true,
      "output_format_correct": true,
      "verification_evidence_provided": true,
      "memory_written": true
    }
  }
}

Response: 201 Created

POST /api/v1/events/batch -- Submit multiple events

Request:

{
  "events": [
    { "agent_id": "...", "type": "...", "timestamp": "...", "data": { } },
    { "agent_id": "...", "type": "...", "timestamp": "...", "data": { } }
  ]
}

Response: 201 Created

{
  "accepted": 47,
  "rejected": 3,
  "errors": [
    {"index": 12, "error": "Invalid event type"},
    {"index": 23, "error": "Missing required field: agent_id"},
    {"index": 41, "error": "Timestamp in future"}
  ]
}

Supported Event Types:

Event TypeRequired Data Fields
task_assignedtask_id, complexity_level, domain, assigned_at
task_completedtask_id, completion_status, time_to_complete_seconds, revision_count, errors
task_revisedtask_id, revision_number, reason_code (critical/major/minor), changes
tool_invokedtask_id, tool, success, unnecessary, error_recovered
delegation_decisiontask_id, task_fragment, decision (self/delegate), delegate_to, ground_truth
delegation_outcometask_id, delegate_agent_id, outcome (success/partial/failure), revision_needed
orchestration_plantask_id, agents_involved, dependency_graph, parallel_groups
orchestration_resulttask_id, outcome, failures_injected, failures_recovered, synthesis_quality_score

Assessments

POST   /api/v1/assessments
GET    /api/v1/assessments/{assessment_id}
GET    /api/v1/assessments?agent_id={agent_id}&status={status}
POST   /api/v1/assessments/{assessment_id}/stress-test

POST /api/v1/assessments -- Trigger an assessment

Request:

{
  "agent_id": "agent_uuid",
  "type": "full",
  "include_stress_test": true,
  "window_start": "2026-03-17T00:00:00Z",
  "window_end": "2026-04-16T00:00:00Z"
}

Response: 202 Accepted

{
  "id": "assessment_uuid",
  "agent_id": "agent_uuid",
  "status": "in_progress",
  "estimated_completion": "2026-04-16T14:00:00Z",
  "created_at": "2026-04-16T12:00:00Z"
}

Assessment types:

GET /api/v1/assessments/{assessment_id} -- Get assessment results

Response (when complete):

{
  "id": "assessment_uuid",
  "agent_id": "agent_uuid",
  "status": "completed",
  "performance": {
    "composite_score": 0.82,
    "tier": "Expert",
    "dimensions": {
      "task_completion_rate": { "score": 0.91, "sample_size": 47, "evidence_count": 47 },
      "accuracy": { "score": 0.85, "sample_size": 47, "evidence_count": 52 },
      "speed": { "score": 0.72, "sample_size": 47, "evidence_count": 47 },
      "consistency": { "score": 0.78, "sample_size": 47, "evidence_count": 47 },
      "review_compliance": { "score": 0.80, "sample_size": 47, "evidence_count": 47 }
    }
  },
  "capability": {
    "composite_score": 0.71,
    "tier": "Specialist",
    "dimensions": {
      "domain_breadth": { "score": 0.42, "qualified_domains": 5, "total_domains": 12 },
      "complexity_ceiling": { "score": 0.80, "highest_level": 4, "evidence_count": 8 },
      "tool_proficiency": { "score": 0.88, "tools_assessed": 5, "evidence_count": 63 },
      "autonomy_level": { "score": 0.67, "highest_level": 2, "evidence_count": 12 },
      "learning_rate": { "score": 0.55, "improvement": 0.165, "evidence_count": 30 },
      "delegation_capability": { "score": 0.75, "decisions_assessed": 40, "evidence_count": 40 },
      "orchestration_skills": { "score": 0.80, "scenarios_assessed": 3, "evidence_count": 3 }
    }
  },
  "badge_urls": {
    "svg": "https://assess.example.com/badges/assessment_uuid.svg",
    "png": "https://assess.example.com/badges/assessment_uuid.png",
    "json": "https://assess.example.com/badges/assessment_uuid.json"
  },
  "valid_until": "2026-07-15T12:00:00Z",
  "created_at": "2026-04-16T12:00:00Z",
  "completed_at": "2026-04-16T13:45:00Z"
}

Badges

GET    /api/v1/badges/{assessment_id}
GET    /api/v1/badges/{assessment_id}/verify
GET    /api/v1/badges/{assessment_id}.svg
GET    /api/v1/badges/{assessment_id}.png
GET    /api/v1/badges/{assessment_id}.json

GET /api/v1/badges/{assessment_id}/verify -- Verify badge authenticity

Response:

{
  "valid": true,
  "assessment_id": "assessment_uuid",
  "agent_name": "security-auditor",
  "organization": "Witness Civilization",
  "performance_tier": "Expert",
  "capability_tier": "Specialist",
  "issued": "2026-04-16T12:00:00Z",
  "valid_until": "2026-07-15T12:00:00Z",
  "expired": false
}

Appeals

POST   /api/v1/assessments/{assessment_id}/appeals
GET    /api/v1/assessments/{assessment_id}/appeals/{appeal_id}

POST /api/v1/assessments/{assessment_id}/appeals -- File an appeal

Request:

{
  "dimensions_challenged": ["accuracy", "delegation_capability"],
  "justification": "Accuracy score penalized for revisions that were scope changes, not errors.",
  "counter_evidence": [
    {
      "task_id": "task_uuid",
      "claim": "Revision was requested scope expansion, not correction of error."
    }
  ]
}

Configuration

GET    /api/v1/organizations/{org_id}/config
PATCH  /api/v1/organizations/{org_id}/config
GET    /api/v1/organizations/{org_id}/config/weights
PUT    /api/v1/organizations/{org_id}/config/weights
GET    /api/v1/organizations/{org_id}/config/taxonomy
PUT    /api/v1/organizations/{org_id}/config/taxonomy

6.3 Webhooks

Organizations can register webhooks for assessment lifecycle events:

POST   /api/v1/webhooks
GET    /api/v1/webhooks
DELETE /api/v1/webhooks/{webhook_id}

Webhook Event Types:

EventTrigger
assessment.startedAssessment begins
assessment.completedAssessment finished, badge ready
assessment.failedAssessment could not complete (insufficient data)
badge.expiringBadge expires in 14 days
badge.expiredBadge validity period ended
appeal.filedAppeal submitted
appeal.resolvedAppeal decision made
evidence.threshold_reachedEnough events collected to trigger assessment

Webhook payload:

{
  "event": "assessment.completed",
  "timestamp": "2026-04-16T13:45:00Z",
  "data": {
    "assessment_id": "assessment_uuid",
    "agent_id": "agent_uuid",
    "performance_tier": "Expert",
    "capability_tier": "Specialist"
  }
}

6.4 Error Handling

Standard error response format:

{
  "error": {
    "code": "INSUFFICIENT_EVIDENCE",
    "message": "Agent has 12 tasks in assessment window. Minimum required: 20.",
    "details": {
      "current_count": 12,
      "required_count": 20,
      "window_start": "2026-03-17T00:00:00Z",
      "window_end": "2026-04-16T00:00:00Z"
    }
  }
}

Error Codes:

HTTP StatusCodeMeaning
400INVALID_REQUESTMalformed request body
400INVALID_EVENT_TYPEUnrecognized event type
401UNAUTHORIZEDMissing or invalid API key
403FORBIDDENAPI key lacks permission for this resource
404NOT_FOUNDResource does not exist
409ASSESSMENT_IN_PROGRESSAssessment already running for this agent
422INSUFFICIENT_EVIDENCENot enough data to run assessment
422WEIGHT_SUM_INVALIDDimension weights do not sum to 1.0
429RATE_LIMITEDRate limit exceeded
500INTERNAL_ERRORServer error

7. Database Schema

7.1 Entity Relationship Summary

organizations 1---* agents
agents 1---* events
agents 1---* assessments
assessments 1---* dimension_scores
assessments 1---* appeals
assessments 1---1 badges
organizations 1---1 org_configs
org_configs 1---* dimension_weight_configs

7.2 Table Definitions

-- Organizations: multi-tenant root entity
CREATE TABLE organizations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            TEXT NOT NULL,
    api_key_hash    TEXT NOT NULL UNIQUE,
    domain_taxonomy TEXT[] NOT NULL DEFAULT ARRAY[
        'research', 'code', 'security', 'testing', 'design',
        'documentation', 'communication', 'orchestration',
        'analysis', 'content', 'infrastructure', 'legal'
    ],
    status          TEXT NOT NULL DEFAULT 'active'
                    CHECK (status IN ('active', 'suspended', 'archived')),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Organization configuration
CREATE TABLE org_configs (
    id                          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id             UUID NOT NULL UNIQUE REFERENCES organizations(id),
    assessment_window_days      INT NOT NULL DEFAULT 30,
    validity_period_days        INT NOT NULL DEFAULT 90,
    min_tasks_performance       INT NOT NULL DEFAULT 20,
    min_tasks_capability        INT NOT NULL DEFAULT 30,
    recency_weights             JSONB NOT NULL DEFAULT '{
        "0_7_days": 1.0,
        "8_14_days": 0.8,
        "15_30_days": 0.6
    }',
    reassessment_task_threshold INT NOT NULL DEFAULT 50,
    created_at                  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at                  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Dimension weight configuration (performance + capability)
CREATE TABLE dimension_weight_configs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_config_id   UUID NOT NULL REFERENCES org_configs(id),
    axis            TEXT NOT NULL CHECK (axis IN ('performance', 'capability')),
    dimension       TEXT NOT NULL,
    weight          NUMERIC(4,3) NOT NULL CHECK (weight >= 0 AND weight <= 1),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (org_config_id, axis, dimension)
);

-- Agents
CREATE TABLE agents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,
    claimed_domains TEXT[] NOT NULL DEFAULT '{}',
    tools           TEXT[] NOT NULL DEFAULT '{}',
    description     TEXT,
    metadata        JSONB NOT NULL DEFAULT '{}',
    status          TEXT NOT NULL DEFAULT 'enrolled'
                    CHECK (status IN ('enrolled', 'active', 'suspended', 'archived')),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (organization_id, name)
);

-- Assessment events (evidence)
CREATE TABLE events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    type            TEXT NOT NULL CHECK (type IN (
        'task_assigned', 'task_completed', 'task_revised',
        'tool_invoked', 'delegation_decision', 'delegation_outcome',
        'orchestration_plan', 'orchestration_result'
    )),
    task_id         UUID,
    timestamp       TIMESTAMPTZ NOT NULL,
    data            JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_events_agent_id ON events(agent_id);
CREATE INDEX idx_events_agent_type ON events(agent_id, type);
CREATE INDEX idx_events_agent_timestamp ON events(agent_id, timestamp DESC);
CREATE INDEX idx_events_task_id ON events(task_id);

-- Assessments
CREATE TABLE assessments (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    type            TEXT NOT NULL CHECK (type IN (
        'full', 'passive_only', 'stress_test_only', 'single_dimension'
    )),
    status          TEXT NOT NULL DEFAULT 'pending'
                    CHECK (status IN ('pending', 'in_progress', 'completed', 'failed')),
    window_start    TIMESTAMPTZ NOT NULL,
    window_end      TIMESTAMPTZ NOT NULL,
    performance_composite   NUMERIC(5,4),
    performance_tier        TEXT CHECK (performance_tier IN (
        'Novice', 'Competent', 'Proficient', 'Expert', 'Elite'
    )),
    capability_composite    NUMERIC(5,4),
    capability_tier         TEXT CHECK (capability_tier IN (
        'Narrow', 'Functional', 'Versatile', 'Specialist', 'Full-Stack'
    )),
    evidence_summary        JSONB,
    valid_until             TIMESTAMPTZ,
    created_at              TIMESTAMPTZ NOT NULL DEFAULT now(),
    completed_at            TIMESTAMPTZ
);

CREATE INDEX idx_assessments_agent_id ON assessments(agent_id);
CREATE INDEX idx_assessments_agent_status ON assessments(agent_id, status);

-- Individual dimension scores per assessment
CREATE TABLE dimension_scores (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id   UUID NOT NULL REFERENCES assessments(id),
    axis            TEXT NOT NULL CHECK (axis IN ('performance', 'capability')),
    dimension       TEXT NOT NULL,
    score           NUMERIC(5,4) NOT NULL CHECK (score >= 0 AND score <= 1),
    sample_size     INT NOT NULL,
    evidence_count  INT NOT NULL,
    details         JSONB NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (assessment_id, axis, dimension)
);

CREATE INDEX idx_dimension_scores_assessment ON dimension_scores(assessment_id);

-- Badges
CREATE TABLE badges (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id   UUID NOT NULL UNIQUE REFERENCES assessments(id),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    svg_url         TEXT,
    png_url         TEXT,
    json_data       JSONB NOT NULL,
    issued_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    valid_until     TIMESTAMPTZ NOT NULL,
    revoked         BOOLEAN NOT NULL DEFAULT false,
    revoked_at      TIMESTAMPTZ,
    revoke_reason   TEXT
);

CREATE INDEX idx_badges_agent_id ON badges(agent_id);
CREATE INDEX idx_badges_valid_until ON badges(valid_until);

-- Appeals
CREATE TABLE appeals (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id       UUID NOT NULL REFERENCES assessments(id),
    dimensions_challenged TEXT[] NOT NULL,
    justification       TEXT NOT NULL,
    counter_evidence    JSONB NOT NULL DEFAULT '[]',
    status              TEXT NOT NULL DEFAULT 'filed'
                        CHECK (status IN ('filed', 'under_review', 'accepted', 'rejected')),
    reviewer_notes      TEXT,
    resolution          JSONB,
    filed_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    resolved_at         TIMESTAMPTZ
);

CREATE INDEX idx_appeals_assessment ON appeals(assessment_id);

-- Webhooks
CREATE TABLE webhooks (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID NOT NULL REFERENCES organizations(id),
    url             TEXT NOT NULL,
    event_types     TEXT[] NOT NULL,
    secret_hash     TEXT NOT NULL,
    active          BOOLEAN NOT NULL DEFAULT true,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Stress test definitions
CREATE TABLE stress_tests (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID REFERENCES organizations(id),  -- NULL = system default
    dimension       TEXT NOT NULL,
    complexity_level INT,
    domain          TEXT,
    task_definition JSONB NOT NULL,
    expected_outcome JSONB,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Stress test results (linked to assessments)
CREATE TABLE stress_test_results (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    assessment_id   UUID NOT NULL REFERENCES assessments(id),
    stress_test_id  UUID NOT NULL REFERENCES stress_tests(id),
    agent_id        UUID NOT NULL REFERENCES agents(id),
    outcome         TEXT NOT NULL CHECK (outcome IN ('pass', 'partial', 'fail')),
    score           NUMERIC(5,4),
    details         JSONB NOT NULL DEFAULT '{}',
    executed_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_stress_test_results_assessment ON stress_test_results(assessment_id);

7.3 Data Retention

Data TypeRetentionRationale
Events180 daysEvidence for 2 assessment cycles
AssessmentsIndefiniteHistorical record, trend analysis
Dimension ScoresIndefiniteTrend analysis per dimension
BadgesIndefinite (mark expired, don't delete)Verification may happen after expiry
AppealsIndefiniteAudit trail
Stress Test Results365 daysCalibration and comparison

8. Phase Breakdown

Phase 1: Standalone Tool (Internal)

Goal: Assess all 30+ agents in the civilization. Produce badges. Identify capability gaps.

Scope:

Architecture:

assessment-agent/
  cli.py                  # CLI entry point
  config.py               # Weights, tiers, taxonomy
  collectors/
    task_log_collector.py  # Parse existing task logs
    memory_collector.py    # Parse agent memory files
    manual_collector.py    # Manual event entry
  scorers/
    performance.py         # Performance dimension scorers
    capability.py          # Capability dimension scorers
    composite.py           # Composite score + tier mapping
  stress_tests/
    runner.py              # Stress test orchestrator
    tasks/                 # Standard stress test task definitions
  badge/
    generator.py           # SVG + JSON badge generation
    templates/             # SVG templates
  reports/
    markdown.py            # Per-agent markdown report
    summary.py             # Civilization-wide summary
  db/
    schema.sql             # SQLite schema
    models.py              # Data access layer
  tests/

CLI Commands:

python cli.py enroll <agent_name> --domains security,code --tools bash,grep
python cli.py collect <agent_name> --source task_logs --from 2026-03-17
python cli.py assess <agent_name> --type full
python cli.py assess --all --type passive_only
python cli.py badge <assessment_id> --format svg,json
python cli.py report <agent_name> --output reports/
python cli.py report --summary --output reports/civilization-summary.md
python cli.py stress-test <agent_name> --dimension delegation

Deliverables:

Timeline: 4-6 weeks.

Phase 2: Platform Service (External)

Goal: Multi-tenant SaaS where any organization can assess their agents.

Scope (additive to Phase 1):

Architecture additions:

assessment-service/
  api/
    routes/
      organizations.py
      agents.py
      events.py
      assessments.py
      badges.py
      appeals.py
      webhooks.py
    middleware/
      auth.py             # OAuth 2.0 + API key
      rate_limit.py
      tenant_isolation.py
    schemas/              # Pydantic request/response models
  workers/
    assessment_worker.py  # Async assessment execution
    badge_worker.py       # Async badge generation
    webhook_worker.py     # Webhook delivery
  sdk/
    python/
    typescript/
  dashboard/              # Web UI

Additional Deliverables:

Timeline: 8-12 weeks after Phase 1 completion.

Phase 3: Ecosystem (Future)

Goal: Industry-standard assessment framework with third-party integrations.

Scope (conceptual, not designed yet):


Appendix A: Glossary

All terms follow ISO/IEC 22989:2022 definitions where applicable.

TermDefinitionSource
AgentAn AI system that perceives its environment and takes actions to achieve goals.ISO/IEC 22989
AssessmentThe process of evaluating an agent's performance and capability across defined dimensions.AAAF
BadgeA visual and machine-readable credential summarizing an agent's assessment results.AAAF
CapabilityThe set of functions an agent can perform, independent of execution quality.AAAF
Complexity CeilingThe maximum task complexity level at which an agent can deliver acceptable results.AAAF
DelegationThe act of an agent routing a task to another agent.AAAF
DimensionA single measurable aspect of performance or capability.AAAF
Human-in-the-loopA human approves every action before execution.ISO/IEC 22989
Human-on-the-loopA human monitors and can intervene at checkpoints.ISO/IEC 22989
Human-over-the-loopA human sets policies; intervenes only on exceptions.ISO/IEC 22989
OrchestrationCoordination of multiple agents to complete a composite task.AAAF
PerformanceHow well an agent executes tasks it is given.AAAF
Stress TestA controlled assessment task designed to probe a specific dimension.AAAF
TierA categorical label mapped from a numeric score range.AAAF

Appendix B: Standards References


End of specification.