Contents

  1. Abstract
  2. The Problem
  3. The Crovia Stack
  4. Layer 0 — Observation
  5. Layer 1 — Cryptographic Evidence
  6. Layer 2 — Intelligence
  7. Layer 3 — Forensic Analysis
  8. Layer 4 — Behavioral Attribution
  9. Layer 5 — Settlement
  10. Operational Metrics
  11. Use Cases
  12. Competitive Analysis
  13. Business Model
  14. Roadmap

1. Abstract

AI models are trained on vast amounts of data. The creators of that data have no reliable way to know if their work was used, how it was used, or whether they are owed compensation. Regulators lack the infrastructure to audit compliance at scale. AI companies themselves have no verifiable way to prove their training practices.

Crovia is a six-layer evidence stack that addresses this gap — not by judging, but by building verifiable records at every stage of the problem: from continuous observation of what models disclose, through cryptographic proof of what they do not, to behavioral analysis of what their weights reveal.

Layers 0 through 3 are operational. Layer 4 is in advanced development. Layer 5 is planned.

Crovia records what exists and what is absent. Every score, every grade, every classification is derived from observable, verifiable signals — never from opinion. The system observes. It does not judge.

2. The Problem

AI training today happens without verifiable records. There is no standard infrastructure to:

The Regulatory Moment

The EU AI Act mandates training data transparency starting 2026. The US has proposed similar legislation. Japan, Canada, Australia, and others are drafting frameworks. But no infrastructure exists to verify compliance with any of them.

ChallengeImpact
No training receipts$50B+ content at risk of unauthorized use
Unverifiable opt-outsGDPR fines up to 4% of global revenue
No settlement mechanismCreators receive $0 from AI training
Undetectable data usageNo forensic tools for regulators or rights holders

The gap is not political will. The gap is infrastructure. Crovia builds that infrastructure.

3. The Crovia Stack

Crovia is organized as a six-layer stack. Each layer produces verifiable artifacts that feed the layers above it. Lower layers are simpler and more mature. Higher layers are more powerful and build on the evidence produced below.

Layer 5: Settlement CPCS Seal, CFIC Certificates, DPI Engine [PLANNED] Layer 4: Attribution CDS v3, DPI Bridge, NEC#1 Integration [IN PROGRESS] Layer 3: Forensics GDNA Fingerprinting, Model Sonar, Agent System [LIVE] Layer 2: Intelligence Risk Index, Exposure Score, Org Grade, Gating [LIVE] Layer 1: Evidence TPA, Merkle Trees, DDF Hashing, CEP Capsules [LIVE] Layer 0: Observation Observer, Oracle, NEC# Canon, Compliance Mapping [LIVE]

The critical insight: each layer is independently verifiable. A TPA proof is valid whether or not GDNA confirms it. A CDS score is meaningful whether or not a settlement follows. The layers compose, but they do not depend.

4. Layer 0 — Observation

The foundation. An autonomous observer continuously monitors AI model documentation across thousands of targets, recording what each model discloses and what it does not.

Observer

Runs hourly. Examines model cards, README files, and associated documentation for 3,500+ models. Each observation is timestamped, hashed, and committed to an append-only ledger.

NEC# Canon

A structured framework of 20 Necessary Elements of Compliance — the disclosure requirements that any AI model should satisfy. Each NEC# maps to specific articles in the EU AI Act, GDPR, CCPA, and 8 other regulatory frameworks across 11 jurisdictions.

Compliance Mapping

For every observed model, automated regulatory gap analysis produces a per-element, per-jurisdiction compliance map. Over 6,800 models have compliance reports with gap identification.

Outreach Pipeline

When documentation gaps are identified, the system contacts model maintainers directly — via GitHub issues and HuggingFace discussions — documenting the disclosure state before and after contact.

5. Layer 1 — Cryptographic Evidence

Raw observations become cryptographic evidence: immutable, timestamped, verifiable by anyone.

Temporal Proof of Absence (TPA) LIVE

The signature innovation. A TPA is a cryptographic proof that training data disclosure was NOT found at a specific point in time. Each proof includes:

This creates a body of temporal evidence that no party can retroactively alter. If a model adds disclosure next month, the TPA proves it was absent last month.

Disclosure Data Fingerprint (DDF)

Every observation includes a hash of the model's documentation state. When documentation changes — especially after outreach contact — the DDF records the exact transition with cryptographic timestamps. This is the basis of the Impact Observatory: verifiable evidence that Crovia's outreach caused documentation improvements.

Crovia Evidence Protocol (CEP) LIVE

Evidence capsules that package multiple proof types into a single, verifiable bundle. A CEP capsule can contain TPA proofs, DDF fingerprints, compliance maps, and forensic results — all anchored to a common Merkle root.

Cryptographic PrimitiveAlgorithmStandard
HashingSHA-256, BLAKE3FIPS 180-4
SignaturesEd25519RFC 8032
TimestampsRFC 3161ITU-T X.509
Merkle TreesSHA-256Bitcoin standard
CommitmentsPedersen (BN254)zkSNARK-compatible

6. Layer 2 — Intelligence

Evidence becomes intelligence. Layer 2 aggregates signals from Layer 0 and Layer 1 into actionable metrics — not opinions, but quantified absence.

Dataset Risk Index LIVE

A per-model risk score aggregated from seven independent signals:

  1. TPA absence — no temporal proof exists
  2. Organization grade — systematic non-disclosure at org level
  3. Temporal pressure — increasing urgency patterns
  4. Gating status — access restrictions detected
  5. Download exposure — high usage amplifies risk
  6. Provenance gaps — missing lineage chain
  7. Outreach non-response — maintainer silence after contact

Each signal is observable and verifiable. The index does not say a model is bad — it says the evidence gap is large.

Exposure Score LIVE

Combines download volume with risk tier. A high-risk model with 10 million downloads represents greater exposure than the same risk at 100 downloads. This drives prioritization for outreach, forensic analysis, and regulatory attention.

Organization Transparency Score LIVE

Per-organization grading (A through F) based on the aggregate transparency of all models under that organization. Factors include TPA coverage, compliance score, response rate to outreach, and severity distribution. Over 2,300 organizations are graded.

Response Classifier LIVE

Classifies outreach responses into categories: engaged, acknowledged, deflected, non-responsive, improved. Cross-references GitHub issue states, gating events, TPA additions, and documentation changes to determine the actual organizational response — not just the words.

Gating Detector LIVE

Monitors access restrictions on models. When a model becomes gated after outreach contact, this is recorded as a potential indicator of documentation avoidance. Temporal evidence establishes whether gating correlated with Crovia observation.

7. Layer 3 — Forensic Analysis

Layers 0–2 examine what models say. Layer 3 examines what models are. This is where Crovia moves from documentation analysis to behavioral and structural forensics.

Geometric DNA (GDNA) LIVE

Weight-level statistical fingerprinting. The GDNA system downloads model weights — shard by shard, streaming — and extracts per-layer statistical signatures without storing the full weights. The resulting fingerprint enables:

An autonomous agent system manages the download queue, rate limiting, and storage box archival for 10 model families across 5 organizations. Processing runs continuously within configured schedule windows.

Model Sonar — Latent Comparison (LC) LIVE

Behavioral fingerprinting across five knowledge domains: medical, legal, code, scientific, and news. For each domain, curated probe pairs measure whether a model exhibits domain-specific knowledge that its documentation does not account for.

Sonar produces per-domain decisions (PRESENT, NEUTRAL, ABSENT) with z-scores calibrated against a global baseline. If a model shows strong medical knowledge but declares no medical training data, the signal is evidence — not proof, but forensically meaningful evidence.

Forensic Correlator LIVE

Cross-references signals from GDNA, Sonar, TPA, compliance maps, and outreach responses to identify patterns. Convergent signals — structural ancestry and behavioral advantage and documentation absence — produce stronger evidence than any single signal alone.

8. Layer 4 — Behavioral Attribution (CDS)

This is the layer that changes everything. Comparative Domain Sensitivity (CDS) answers the question Layer 3 cannot: was this specific dataset used to train this specific model?

CDS v3 Engine IN PROGRESS

CDS measures a model's behavioral advantage on a dataset compared to a reference pool of models. The key insight: if model M performs disproportionately better on dataset D than reference models do — after controlling for model capacity — this is evidence that M was trained on D.

The method:

  1. BPC computation — Bits-per-character loss on the target dataset, normalized to eliminate tokenizer confounds
  2. Reference pool comparison — Median BPC across a pool of reference models establishes the baseline
  3. Capacity calibration — The same measurement on neutral text removes the effect of model size (a larger model is better at everything, not just the target dataset)
  4. CDS v3 scoreCDS_v3 = CDS_raw − CDS_neutral
  5. Bootstrap CI — 1,000-iteration bootstrap produces a 95% confidence interval

A positive CDS v3 score means the model shows dataset-specific advantage beyond what capacity alone explains. The score is accompanied by a signal label: STRONG_POSITIVE, WEAK_POSITIVE, NEUTRAL, WEAK_NEGATIVE, or STRONG_NEGATIVE.

CDS Artifacts

Every CDS computation produces a verifiable artifact: a self-contained evidence object with SHA-256 commitment hash, BPC vector hashes for reproducibility, optional Pedersen commitment for ZK compatibility, and an anchor digest for external timestamping (OpenTimestamps, RFC 3161).

Artifacts are compatible with the Evidence Envelope V1 schema, the ForwardChain anchoring protocol, and CRC-1 capsule packaging.

GDNA + CDS: Converging Evidence

When both GDNA (structural ancestry) and CDS (behavioral advantage) point to the same conclusion, the combined evidence is substantially stronger than either alone:

GDNACDSCombined VerdictStrength
PositivePositiveConverging EvidenceStrong
PositiveNegativeStructural OnlyModerate
NegativePositiveBehavioral OnlyModerate
NegativeNegativeNo EvidenceNone

NEC#1 Integration

CDS scores feed directly into NEC#1 (data provenance) compliance assessment. A strong CDS signal on an undisclosed dataset increases the NEC#1 severity modifier — the compliance gap becomes not just "they didn't say" but "we have behavioral evidence they should have said."

9. Layer 5 — Settlement

The final layer: evidence becomes compensation. This is planned, not yet operational.

DPI Bridge IN PROGRESS

The Data Provider Impact engine converts CDS detection into provider compensation. When CDS detects behavioral advantage on dataset D, the provider of D receives a coverage boost in the settlement estimation. The feedback loop:

CDS detects advantage on dataset D → Provider D gets coverage_boost → Settlement weight increases → Provider D receives higher payout → CFIC certificate includes CDS artifact hash as evidence

The bridge is conservative by design: only strong CDS signals produce boosts, boosts are capped to prevent gaming, and every boost carries an evidence hash traceable to the original CDS artifact.

CPCS Seal PLANNED

The Crovia Production Commit Seal: a Merkle-anchored payout record with an immutable audit trail. Every settlement is verifiable, every payout is traceable.

CFIC Certificates PLANNED

Crovia Fair Impact Certificates: cryptographic receipts that prove a data provider received compensation, with embedded references to the evidence (CDS artifacts, TPA proofs, compliance gaps) that justified the payout.

10. Operational Metrics

Current system state as of March 2026:

ComponentStatusVolume
ObserverLIVE3,500+ unique targets, 500 observations/day
TPA ChainLIVE6,800+ models, chain height 12,500+
Compliance ReportsLIVE6,800+ models × 20 NEC# × 11 jurisdictions
Outreach PipelineLIVE180+ issues classified
Org TransparencyLIVE2,300+ organizations graded A–F
Dataset Risk IndexLIVE5,900+ models scored across 7 signals
Provenance GraphLIVE10,500+ nodes, 25,900+ edges
Model SonarLIVE16+ models × 5 domains
GDNA AgentLIVE10 model families, streaming shard analysis
CDS v3 EngineIN PROGRESSBatch scanner operational

11. Use Cases

For Regulators

Problem: No infrastructure to audit AI training compliance at scale.

Crovia provides: Immutable observation ledger with temporal proofs. Compliance mapping across 11 jurisdictions. Behavioral forensics (GDNA + Sonar + CDS) that go beyond what companies self-report. Risk index that prioritizes the largest evidence gaps.

For Content Creators & Rights Holders

Problem: No way to know if your data was used, and no mechanism to be compensated.

Crovia provides: CDS can detect behavioral evidence of training on specific datasets. The DPI Bridge converts detection into provider impact weight. Exposure scoring prioritizes high-download models where the impact is largest. Every step is backed by cryptographic evidence suitable for legal proceedings.

For AI Companies

Problem: No verifiable way to prove training data compliance.

Crovia provides: If your model genuinely complies, the evidence shows it. TPA records demonstrate disclosure. Compliance mapping confirms regulatory coverage. CDS analysis that shows NO_DATASET_SIGNAL is exculpatory evidence. Transparency is a competitive advantage when evidence exists to back it.

12. Competitive Analysis

Capability Crovia C2PA Spawning Fairly Trained
Temporal Absence Proofs LiveNoNoNo
Disclosure Monitoring LiveNoNoNo
Regulatory Gap Analysis LiveNoNoNo
Behavioral Fingerprinting LiveNoNoNo
Weight-Level Forensics LiveNoNoNo
Dataset Attribution In ProgressNoNoNo
Risk Intelligence LiveNoNoNo
Content Provenance NoYesNoNo
Opt-out Registry NoNoYesNo
Training Certification NoNoNoYes
Settlement Engine PlannedNoNoNo

Crovia addresses a different problem space than C2PA (content provenance), Spawning (opt-out), or Fairly Trained (ethical certification). These approaches are complementary. Crovia is the only system that combines observation, cryptographic evidence, behavioral forensics, and dataset attribution in a single stack.

13. Business Model

TierStatusIncludes
Open Available Public registry, observation ledger, TPA data, compliance reports, risk index, org scores, outreach status, public API
Pro In Development CDS v3 scanner, GDNA analysis, Model Sonar reports, full NEC# canon (20/20), forensic correlator, CEP capsule builder
Enterprise Planned On-premise deployment, custom model family monitoring, DPI integration, settlement engine, dedicated support

14. Roadmap

PhaseTimelineStatusMilestones
FoundationQ4 2025 – Q1 2026 Complete Observer, TPA, NEC# canon, public registry, outreach pipeline, compliance mapping
Evidence & IntelligenceQ1 2026 Complete Risk Index, Exposure Score, Org Grade, Response Classifier, Gating Detector, CEP capsules, provenance graph
ForensicsQ1 – Q2 2026 Live GDNA fingerprinting, Model Sonar, Agent system, Forensic Correlator
AttributionQ2 2026 In Progress CDS v3 engine, DPI Bridge, GDNA+CDS converging evidence, NEC#1 integration
SettlementQ3 – Q4 2026 Planned CPCS seals, CFIC certificates, royalty distribution, enterprise integrations