Crovia

Evidence Infrastructure for AI Training Transparency

v2.0 — March 2026·CroviaTrust·[email protected]

Abstract
The Problem
The Crovia Stack
Layer 0 — Observation
Layer 1 — Cryptographic Evidence
Layer 2 — Intelligence
Layer 3 — Forensic Analysis
Layer 4 — Behavioral Attribution
Layer 5 — Settlement
Operational Metrics
Use Cases
Competitive Analysis
Business Model
Roadmap

1. Abstract

AI models are trained on vast amounts of data. The creators of that data have no reliable way to know if their work was used, how it was used, or whether they are owed compensation. Regulators lack the infrastructure to audit compliance at scale. AI companies themselves have no verifiable way to prove their training practices.

Crovia is a six-layer evidence stack that addresses this gap — not by judging, but by building verifiable records at every stage of the problem: from continuous observation of what models disclose, through cryptographic proof of what they do not, to behavioral analysis of what their weights reveal.

Layers 0 through 3 are operational. Layer 4 is in advanced development. Layer 5 is planned.

Crovia records what exists and what is absent. Every score, every grade, every classification is derived from observable, verifiable signals — never from opinion. The system observes. It does not judge.

2. The Problem

AI training today happens without verifiable records. There is no standard infrastructure to:

Prove which data was used to train a model
Verify that opt-out requests were honored
Detect undisclosed training data from behavioral evidence
Distribute royalties fairly to data contributors
Audit training compliance across jurisdictions at scale

The Regulatory Moment

The EU AI Act mandates training data transparency starting 2026. The US has proposed similar legislation. Japan, Canada, Australia, and others are drafting frameworks. But no infrastructure exists to verify compliance with any of them.

Challenge	Impact
No training receipts	$50B+ content at risk of unauthorized use
Unverifiable opt-outs	GDPR fines up to 4% of global revenue
No settlement mechanism	Creators receive $0 from AI training
Undetectable data usage	No forensic tools for regulators or rights holders

The gap is not political will. The gap is infrastructure. Crovia builds that infrastructure.

3. The Crovia Stack

Crovia is organized as a six-layer stack. Each layer produces verifiable artifacts that feed the layers above it. Lower layers are simpler and more mature. Higher layers are more powerful and build on the evidence produced below.

Layer 5: Settlement CPCS Seal, CFIC Certificates, DPI Engine [PLANNED] Layer 4: Attribution CDS v3, DPI Bridge, NEC#1 Integration [IN PROGRESS] Layer 3: Forensics GDNA Fingerprinting, Model Sonar, Agent System [LIVE] Layer 2: Intelligence Risk Index, Exposure Score, Org Grade, Gating [LIVE] Layer 1: Evidence TPA, Merkle Trees, DDF Hashing, CEP Capsules [LIVE] Layer 0: Observation Observer, Oracle, NEC# Canon, Compliance Mapping [LIVE]

The critical insight: each layer is independently verifiable. A TPA proof is valid whether or not GDNA confirms it. A CDS score is meaningful whether or not a settlement follows. The layers compose, but they do not depend.

4. Layer 0 — Observation

The foundation. An autonomous observer continuously monitors AI model documentation across thousands of targets, recording what each model discloses and what it does not.

Observer

Runs hourly. Examines model cards, README files, and associated documentation for 3,500+ models. Each observation is timestamped, hashed, and committed to an append-only ledger.

NEC# Canon

A structured framework of 20 Necessary Elements of Compliance — the disclosure requirements that any AI model should satisfy. Each NEC# maps to specific articles in the EU AI Act, GDPR, CCPA, and 8 other regulatory frameworks across 11 jurisdictions.

Compliance Mapping

For every observed model, automated regulatory gap analysis produces a per-element, per-jurisdiction compliance map. Over 6,800 models have compliance reports with gap identification.

Outreach Pipeline

When documentation gaps are identified, the system contacts model maintainers directly — via GitHub issues and HuggingFace discussions — documenting the disclosure state before and after contact.

5. Layer 1 — Cryptographic Evidence

Raw observations become cryptographic evidence: immutable, timestamped, verifiable by anyone.

Temporal Proof of Absence (TPA) LIVE

The signature innovation. A TPA is a cryptographic proof that training data disclosure was NOT found at a specific point in time. Each proof includes:

Pedersen commitment — ZK-compatible binding to the observation
Merkle anchor — position in the global evidence tree
Ed25519 signature — non-repudiable issuer attestation
SHA-256 chain link — temporal ordering within the proof chain

This creates a body of temporal evidence that no party can retroactively alter. If a model adds disclosure next month, the TPA proves it was absent last month.

Disclosure Data Fingerprint (DDF)

Every observation includes a hash of the model's documentation state. When documentation changes — especially after outreach contact — the DDF records the exact transition with cryptographic timestamps. This is the basis of the Impact Observatory: verifiable evidence that Crovia's outreach caused documentation improvements.

Crovia Evidence Protocol (CEP) LIVE

Evidence capsules that package multiple proof types into a single, verifiable bundle. A CEP capsule can contain TPA proofs, DDF fingerprints, compliance maps, and forensic results — all anchored to a common Merkle root.

Cryptographic Primitive	Algorithm	Standard
Hashing	SHA-256, BLAKE3	FIPS 180-4
Signatures	Ed25519	RFC 8032
Timestamps	RFC 3161	ITU-T X.509
Merkle Trees	SHA-256	Bitcoin standard
Commitments	Pedersen (BN254)	zkSNARK-compatible

6. Layer 2 — Intelligence

Evidence becomes intelligence. Layer 2 aggregates signals from Layer 0 and Layer 1 into actionable metrics — not opinions, but quantified absence.

Dataset Risk Index LIVE

A per-model risk score aggregated from seven independent signals:

TPA absence — no temporal proof exists
Organization grade — systematic non-disclosure at org level
Temporal pressure — increasing urgency patterns
Gating status — access restrictions detected
Download exposure — high usage amplifies risk
Provenance gaps — missing lineage chain
Outreach non-response — maintainer silence after contact

Each signal is observable and verifiable. The index does not say a model is bad — it says the evidence gap is large.

Exposure Score LIVE

Combines download volume with risk tier. A high-risk model with 10 million downloads represents greater exposure than the same risk at 100 downloads. This drives prioritization for outreach, forensic analysis, and regulatory attention.

Organization Transparency Score LIVE

Per-organization grading (A through F) based on the aggregate transparency of all models under that organization. Factors include TPA coverage, compliance score, response rate to outreach, and severity distribution. Over 2,300 organizations are graded.

Response Classifier LIVE

Classifies outreach responses into categories: engaged, acknowledged, deflected, non-responsive, improved. Cross-references GitHub issue states, gating events, TPA additions, and documentation changes to determine the actual organizational response — not just the words.

Gating Detector LIVE

Monitors access restrictions on models. When a model becomes gated after outreach contact, this is recorded as a potential indicator of documentation avoidance. Temporal evidence establishes whether gating correlated with Crovia observation.

7. Layer 3 — Forensic Analysis

Layers 0–2 examine what models say. Layer 3 examines what models are. This is where Crovia moves from documentation analysis to behavioral and structural forensics.

Geometric DNA (GDNA) LIVE

Weight-level statistical fingerprinting. The GDNA system downloads model weights — shard by shard, streaming — and extracts per-layer statistical signatures without storing the full weights. The resulting fingerprint enables:

Model-to-model provenance — Mantel correlation detects structural ancestry
Fine-tune detection — statistical drift between base and derivative models
Undisclosed lineage — when a model's weights reveal a parent it does not credit

An autonomous agent system manages the download queue, rate limiting, and storage box archival for 10 model families across 5 organizations. Processing runs continuously within configured schedule windows.

Model Sonar — Latent Comparison (LC) LIVE

Behavioral fingerprinting across five knowledge domains: medical, legal, code, scientific, and news. For each domain, curated probe pairs measure whether a model exhibits domain-specific knowledge that its documentation does not account for.

Sonar produces per-domain decisions (PRESENT, NEUTRAL, ABSENT) with z-scores calibrated against a global baseline. If a model shows strong medical knowledge but declares no medical training data, the signal is evidence — not proof, but forensically meaningful evidence.

Forensic Correlator LIVE

Cross-references signals from GDNA, Sonar, TPA, compliance maps, and outreach responses to identify patterns. Convergent signals — structural ancestry and behavioral advantage and documentation absence — produce stronger evidence than any single signal alone.

8. Layer 4 — Behavioral Attribution (CDS)

This is the layer that changes everything. Comparative Domain Sensitivity (CDS) answers the question Layer 3 cannot: was this specific dataset used to train this specific model?

CDS v3 Engine IN PROGRESS

CDS measures a model's behavioral advantage on a dataset compared to a reference pool of models. The key insight: if model M performs disproportionately better on dataset D than reference models do — after controlling for model capacity — this is evidence that M was trained on D.

The method:

BPC computation — Bits-per-character loss on the target dataset, normalized to eliminate tokenizer confounds
Reference pool comparison — Median BPC across a pool of reference models establishes the baseline
Capacity calibration — The same measurement on neutral text removes the effect of model size (a larger model is better at everything, not just the target dataset)
CDS v3 score — CDS_v3 = CDS_raw − CDS_neutral
Bootstrap CI — 1,000-iteration bootstrap produces a 95% confidence interval

A positive CDS v3 score means the model shows dataset-specific advantage beyond what capacity alone explains. The score is accompanied by a signal label: STRONG_POSITIVE, WEAK_POSITIVE, NEUTRAL, WEAK_NEGATIVE, or STRONG_NEGATIVE.

CDS Artifacts

Every CDS computation produces a verifiable artifact: a self-contained evidence object with SHA-256 commitment hash, BPC vector hashes for reproducibility, optional Pedersen commitment for ZK compatibility, and an anchor digest for external timestamping (OpenTimestamps, RFC 3161).

Artifacts are compatible with the Evidence Envelope V1 schema, the ForwardChain anchoring protocol, and CRC-1 capsule packaging.

GDNA + CDS: Converging Evidence

When both GDNA (structural ancestry) and CDS (behavioral advantage) point to the same conclusion, the combined evidence is substantially stronger than either alone:

GDNA	CDS	Combined Verdict	Strength
Positive	Positive	Converging Evidence	Strong
Positive	Negative	Structural Only	Moderate
Negative	Positive	Behavioral Only	Moderate
Negative	Negative	No Evidence	None

NEC#1 Integration

CDS scores feed directly into NEC#1 (data provenance) compliance assessment. A strong CDS signal on an undisclosed dataset increases the NEC#1 severity modifier — the compliance gap becomes not just "they didn't say" but "we have behavioral evidence they should have said."

9. Layer 5 — Settlement

The final layer: evidence becomes compensation. This is planned, not yet operational.

DPI Bridge IN PROGRESS

The Data Provider Impact engine converts CDS detection into provider compensation. When CDS detects behavioral advantage on dataset D, the provider of D receives a coverage boost in the settlement estimation. The feedback loop:

CDS detects advantage on dataset D → Provider D gets coverage_boost → Settlement weight increases → Provider D receives higher payout → CFIC certificate includes CDS artifact hash as evidence

The bridge is conservative by design: only strong CDS signals produce boosts, boosts are capped to prevent gaming, and every boost carries an evidence hash traceable to the original CDS artifact.

CPCS Seal PLANNED

The Crovia Production Commit Seal: a Merkle-anchored payout record with an immutable audit trail. Every settlement is verifiable, every payout is traceable.

CFIC Certificates PLANNED

Crovia Fair Impact Certificates: cryptographic receipts that prove a data provider received compensation, with embedded references to the evidence (CDS artifacts, TPA proofs, compliance gaps) that justified the payout.

10. Operational Metrics

Current system state as of March 2026:

Component	Status	Volume
Observer	LIVE	3,500+ unique targets, 500 observations/day
TPA Chain	LIVE	6,800+ models, chain height 12,500+
Compliance Reports	LIVE	6,800+ models × 20 NEC# × 11 jurisdictions
Outreach Pipeline	LIVE	180+ issues classified
Org Transparency	LIVE	2,300+ organizations graded A–F
Dataset Risk Index	LIVE	5,900+ models scored across 7 signals
Provenance Graph	LIVE	10,500+ nodes, 25,900+ edges
Model Sonar	LIVE	16+ models × 5 domains
GDNA Agent	LIVE	10 model families, streaming shard analysis
CDS v3 Engine	IN PROGRESS	Batch scanner operational

11. Use Cases

For Regulators

Problem: No infrastructure to audit AI training compliance at scale.

Crovia provides: Immutable observation ledger with temporal proofs. Compliance mapping across 11 jurisdictions. Behavioral forensics (GDNA + Sonar + CDS) that go beyond what companies self-report. Risk index that prioritizes the largest evidence gaps.

For Content Creators & Rights Holders

Problem: No way to know if your data was used, and no mechanism to be compensated.

Crovia provides: CDS can detect behavioral evidence of training on specific datasets. The DPI Bridge converts detection into provider impact weight. Exposure scoring prioritizes high-download models where the impact is largest. Every step is backed by cryptographic evidence suitable for legal proceedings.

For AI Companies

Problem: No verifiable way to prove training data compliance.

Crovia provides: If your model genuinely complies, the evidence shows it. TPA records demonstrate disclosure. Compliance mapping confirms regulatory coverage. CDS analysis that shows NO_DATASET_SIGNAL is exculpatory evidence. Transparency is a competitive advantage when evidence exists to back it.

12. Competitive Analysis

Capability	Crovia	C2PA	Spawning	Fairly Trained
Temporal Absence Proofs	Live	No	No	No
Disclosure Monitoring	Live	No	No	No
Regulatory Gap Analysis	Live	No	No	No
Behavioral Fingerprinting	Live	No	No	No
Weight-Level Forensics	Live	No	No	No
Dataset Attribution	In Progress	No	No	No
Risk Intelligence	Live	No	No	No
Content Provenance	No	Yes	No	No
Opt-out Registry	No	No	Yes	No
Training Certification	No	No	No	Yes
Settlement Engine	Planned	No	No	No

Crovia addresses a different problem space than C2PA (content provenance), Spawning (opt-out), or Fairly Trained (ethical certification). These approaches are complementary. Crovia is the only system that combines observation, cryptographic evidence, behavioral forensics, and dataset attribution in a single stack.

13. Business Model

Tier	Status	Includes
Open	Available	Public registry, observation ledger, TPA data, compliance reports, risk index, org scores, outreach status, public API
Pro	In Development	CDS v3 scanner, GDNA analysis, Model Sonar reports, full NEC# canon (20/20), forensic correlator, CEP capsule builder
Enterprise	Planned	On-premise deployment, custom model family monitoring, DPI integration, settlement engine, dedicated support

14. Roadmap

Phase	Timeline	Status	Milestones
Foundation	Q4 2025 – Q1 2026	Complete	Observer, TPA, NEC# canon, public registry, outreach pipeline, compliance mapping
Evidence & Intelligence	Q1 2026	Complete	Risk Index, Exposure Score, Org Grade, Response Classifier, Gating Detector, CEP capsules, provenance graph
Forensics	Q1 – Q2 2026	Live	GDNA fingerprinting, Model Sonar, Agent system, Forensic Correlator
Attribution	Q2 2026	In Progress	CDS v3 engine, DPI Bridge, GDNA+CDS converging evidence, NEC#1 integration
Settlement	Q3 – Q4 2026	Planned	CPCS seals, CFIC certificates, royalty distribution, enterprise integrations

Crovia

Contents

1. Abstract

2. The Problem

The Regulatory Moment

3. The Crovia Stack

4. Layer 0 — Observation

Observer

NEC# Canon

Compliance Mapping

Outreach Pipeline

5. Layer 1 — Cryptographic Evidence

Temporal Proof of Absence (TPA) LIVE

Disclosure Data Fingerprint (DDF)

Crovia Evidence Protocol (CEP) LIVE

6. Layer 2 — Intelligence

Dataset Risk Index LIVE

Exposure Score LIVE

Organization Transparency Score LIVE

Response Classifier LIVE

Gating Detector LIVE

7. Layer 3 — Forensic Analysis

Geometric DNA (GDNA) LIVE

Model Sonar — Latent Comparison (LC) LIVE

Forensic Correlator LIVE

8. Layer 4 — Behavioral Attribution (CDS)

CDS v3 Engine IN PROGRESS

CDS Artifacts

GDNA + CDS: Converging Evidence

NEC#1 Integration

9. Layer 5 — Settlement

DPI Bridge IN PROGRESS

CPCS Seal PLANNED

CFIC Certificates PLANNED

10. Operational Metrics

11. Use Cases

For Regulators

For Content Creators & Rights Holders

For AI Companies

12. Competitive Analysis

13. Business Model

14. Roadmap