# Datasheet — Crovia Continuity Observatory Following the *Datasheets for Datasets* template (Gebru et al., 2018, [arXiv:1803.09010](https://arxiv.org/abs/1803.09010)). ## Motivation **For what purpose was the dataset created?** To create a public, cryptographically signed observation log of how AI vendors disclose — or fail to disclose — material information about their models over time. The dataset documents *silence*: the gaps in vendor disclosures that conventional transparency indices (which rely on vendor self-reports) cannot capture. **Who created the dataset?** Crovia Trust, an independent continuity observatory project. Maintained by the founding engineer with no commercial sponsorship. No vendor input. **What support was needed?** None institutional. Built and operated on a single Hetzner CPX42 instance with Internet Archive Wayback Machine as a public secondary source. Cost order of magnitude: tens of EUR/month. ## Composition **What does the dataset represent?** Each row is an *observation* of a target (a specific AI model, dataset, organization, or repository) at a specific timestamp. Observations are typed: - `presence` — public artifact reachable, content fingerprinted - `absence` — public artifact reachable, but a class of disclosure (e.g., training data, licenses, eval) is empirically missing or unchanged for ≥ N days (`AX.ABS`) - `change` — content changed compared to the prior observation - `signature` — administrative event (Merkle anchor, OTS stamp, ledger seal) **How many instances?** At deposit time: 111,054 observations · 3,500+ unique targets · spanning approx. 2025-Q4 to present. **What schema?** Per row (JSONL): ```json { "observation_id": "", "target_id": "org/model", "tipo_target": "model" | "dataset" | "organization" | "repository", "type": "presence" | "absence" | "change" | "signature", "observed_at": "", "source": "wayback" | "hf_card_http" | "arxiv" | "robots_txt" | "github_api" | "...", "evidence": { "...": "..." }, "claim": { "...": "..." }, "signature": { "alg": "Ed25519", "sig": "", "pub": "" }, "merkle_leaf_hash": "", "ledger_offset": } ``` **Sensitive content?** None. All targets are public artifacts (HF model pages, GitHub repos, arXiv papers). No personal data of unidentified individuals. Vendor org names and product names are present (these are public). **Recommended uses?** - Studying the temporal behavior of AI disclosures over time - Calibrating compliance frameworks (EU AI Act Art. 53/55) - Cross-validating other transparency indices (Stanford FMTI, MIT AI Index) - Anchoring legal/insurance claims that require third-party-witnessed silence - Research on “proof-of-silence” cryptographic primitives **Discouraged uses?** - Inferring intent. The dataset records what is publicly observable. It does not ascribe motive. - Naming and shaming individual employees. Crovia osserva, non giudica. ## Collection process **How was data acquired?** Three primary observation channels, each scheduled by independent timers: 1. **Wayback channel** — query Internet Archive TimeMap for the canonical URL of every target. Each capture becomes one observation. This is the strongest channel because the source is a third party. 2. **Direct channel** — polite, respectful HTTP fetch of the canonical page (HF model card, arXiv abstract, robots.txt). Only used when wayback coverage is sparse. 3. **Co-attestation channel** — Crovia submits each target to wayback Save Page Now, ensuring the Internet Archive captures it independently. This produces a *future* secondary witness. **What target population was sampled?** The unified target list at deposit time contains ~4,300 entries seeded from: the EU AI Office GPAI systemic-risk list, the Stanford CRFM Foundation Model Index, the Hugging Face top-1000 by downloads, the GitHub topic:llm/topic:ai stars>500 set, and frontier-focus models with active legal exposure. **Sampling bias?** The list is intentionally biased toward (a) high-traffic, (b) policy-relevant, (c) regulatory-disclosed models. It does not attempt to represent the long tail of community fine-tunes. This is a *load-bearing* sample, not a *representative* one. **Time period?** Observations begin 2025-11-24 (genesis of the axiom ledger). New observations are appended continuously. ## Preprocessing / cleaning - Test-only and demo targets are filtered before silence indexing. - Duplicate observations of the same `(target, capture_url)` are deduplicated by hash. - The full append-only ledger is preserved unmodified; cleaning happens only at the index layer. ## Uses **Has the dataset been used yet?** - Live website at for public exploration. - Per-model dossier pages at . - Embeddable widgets and SVG badges via . - Weekly leaderboard publication on GitHub: . ## Distribution **License** Data: CC-BY-4.0. Attribution: cite the deposit DOI and link to . Code (collectors, generators): Apache-2.0 / MIT (see individual repos). **How will it be distributed?** - Public website (continuous) - Weekly Zenodo deposit (versioned DOIs) - GitHub repository (snapshots) ## Maintenance **Who maintains?** Crovia Trust founder. Single-maintainer at present. Help is welcome. **How will errors be communicated?** Issues at . The append-only ledger means errors are not *deleted*; they are *amended* with a follow-up observation that supersedes them. --- *Crovia osserva, non giudica.*