Datasets

Every machine-readable dataset in the Alexanarch corpus — sizes, record counts, what each carries, and which UI surfaces consume it. The canonical machine-readable catalog is /api/index.json; this page is its human-readable companion. Counts in teal are fetched live from the index on page load.

Primary registries

Canonical sources of truth. Editing these is the only way to change what the site shows.

data/registry.json 5.97 MB · 881 deposits
The canonical deposit registry. Each entry: bibliographic metadata, canonical v2 AXN, content hash, full-text path, entities[] (subject/predicate/object/type/evidence_status triples — the graph's source), wiki_article, Phase C references_concepts[] + references_concept_count, legacy AXN aliases, and glyphic_canary.
Protocol: deposit-protocol.json (v1)
data/entity-index.json 4.93 MB · 7,173 concepts
Curated concept layer. Each: term, definition, defined_in (founder deposit, on 7,097 of 7,173), entity_triples[], type taxonomy (specification/extracted/structural/empirical/theoretical/formal/genre/foundational/method), engagement type, and Phase C referenced_in[] + reference_count on 2,120.
Refined from: data/lexical-minting-registry.json
Bridged with: data/semantic-addresses.json (348 of 7,173 concepts also targeted by canonical queries)
Consumed by: /wiki/ /graph/
data/lexical-minting-registry.json 3.52 MB · 12,032 raw terms
Broader pre-curation surface. Every term minted, coined, or formally extracted across the corpus — before noise-filtering and curation into the entity-index. 7,045 terms overlap with entity-index, 4,987 are LMR-only (raw), 128 are entity-index-only.
Consumed by: /lexical/
data/citation-graph.json 1.59 MB · 4,866 edges
Inter-deposit citation edges. After Phase B extraction (commit ee1a1db) edges include: doi_resolution (4,311 — legacy), deposit_number_reference (346 — #N), ea_id_reference (158 — EA-* sovereign IDs), axn_hex_reference (21), axn_reference (12 — full canonical), plus 11 hand-curated types. 696 citing deposits, 445 cited.
Generated by: scripts/citation_extractor.py
Consumed by: /citations/
data/semantic-addresses.json 1.28 MB · 1,964 addresses
Query addresses — canonical queries posed to the composition layer (AI Overview, Google Search) with their observation status. Conceptually a field-set on lexical entities: 100% of the 348 unique refers_to targets exact-match into entity-index. Each address: canonical_query, is_quoted, refers_to[], type, battery_membership[], sources[], observations[], observation_class.
Class counts: subjunctive 1,748 · observed 111 · verified-non 22 · unrated 83
Reconciled with: data/EA-WG-CAPTURES-01-v8.3.json (observations[] entries are capture references)
Consumed by: no dedicated UI surface yet — planned at /addresses/
data/EA-WG-CAPTURES-01-v8.3.json 241 KB · 176 captures
AI Overview Capture Registry — 176 captured queries with match status (mt): EXACT MATCH / BROAD MATCH / ADOPTION / ZERO RESULT / null. Each entry: section, slug, query, date, source-format, status, description, image refs. Reconciled into semantic-addresses via slug.
Consumed by: deposit #3's wiki entry only — planned at /captures/
data/doi-resolution-index.json 1.49 MB · 1,675 mappings
Legacy Zenodo DOI → Alexanarch AXN resolution table. Allows users coming in from old CHA DOI links to find the canonical successor record. Each entry: Zenodo DOI, target AXN, deposit_number, title, status (active / migrated / tombstoned).
Consumed by: no UI surface yet — planned at /resolve/
data/datacite-full-backup.json 9.06 MB
Full DataCite metadata snapshot of all 1,817 CHA-minted DOIs. The empirical foundation for the audit in EA-MPAI-DOI-IMPERMANENCE-01 v2.0 (#868). Methodology replicable via DataCite API at https://api.datacite.org/dois/{doi}.
Paginated copies: page 1 · page 2 · page 3 · page 4
Consumed by: Reference dataset; no live UI projection

Derived surfaces

Regenerated from primary registries. Edits to these are overwritten on next regeneration. Generated by scripts/regenerate_surfaces.py.

data/browse-index.json 443 KB · 881 entries
Compact deposit list (no full-text bodies). Used by the static no-JS browse page.
data/chunks/registry/ 9 chunks · ~1 MB each
Registry split into ~1 MB chunks for human-loadable browsing. Each chunk covers a contiguous deposit-number range; _index.json catalogs the chunks.
/sitemap.xml 103 KB
XML sitemap. Every static record page enumerated for crawler indexing.
/SHA256SUMS.txt 160 KB · 881 lines
Content-addressable checksums for every deposit. Lets any mirror verify byte-exactness of the corpus.

Protocols & schemas

Machine-enforced definitions. Hand-editing these produces drift detected by bootstrap_familiarization.py. Use protocol_update.py to amend protocols atomically.

api/index.json central catalog
Single source of truth. Lists all protocols, schemas, registries, derived surfaces, scripts — each with content_sha256, canonical_path, and referenced_by. New instances run bootstrap_familiarization.py to verify nothing has drifted before any work.
api/deposit-protocol.json v1 (alexanarch-deposit-protocol/v1)
Deposit validation rules. Rule families: PV (protocol version) / REQ (required fields) / AXN (identifier) / CONS (consistency) / SUR (surface) / IDX (index integrity). Enforced by scripts/validate_deposit.py and CI.
api/axn-protocol.json v2 (axn/v2)
Alexanarch Identifier protocol. Format: AXN:<HEX>.<FAMILY>.<6 EMOJI> where the six-emoji suffix is derived from the first 6 bytes of SHA-256 of canonical bytes, mapped through 256 curated emoji. v1 (4-emoji) aliases preserved in deposit legacy_axn / axn_history.
api/enrichment-protocol.json v1 (enrichment/v1)
Citation extraction + concept backlink protocol. Defines via-types for citation edges (axn_reference, axn_hex_reference, deposit_number_reference, ea_id_reference, doi_resolution, etc.) and the Phase C bidirectional concept↔deposit indexing.
api/deposit-schema.json + api/schemas/deposit-entry.schema.json JSON Schema
Submission schema (form-facing) and registry entry schema (storage-facing).

Supporting datasets

Reference data not (yet) consumed by any UI surface. Available for inspection, citation, and downstream tooling.

data/external-source-registry.json 84 KB
Catalog of external reception sources (sites, archives, mirrors) referenced across the corpus.
data/heteronym-doi-sift.json 44 KB
Cross-reference of Dodecad heteronym attribution across the legacy Zenodo DOIs.
data/JOURNAL-MAPPING-PRELIMINARY.json 227 KB
Preliminary mapping of deposits to journal-style groupings.
data/batch-axn-assignment.json 974 KB
Historical record of batch AXN assignments during the CHA → Alexanarch migration.
data/restoration-batch-plan.json 20 KB
Restoration plan ledger.
data/zenodo-link-scan.json 217 KB
Survey of inbound and outbound Zenodo links across the corpus.

Per-deposit text bodies

Every deposit's canonical text is stored at data/texts/AXN-<HEX>-text.md. The hex maps via the registry's hex field to a canonical deposit_number. Body SHA-256 anchors each text into the AXN.

data/texts/AXN-*-text.md ~25.5 MB total · 881 files
One Markdown file per deposit. The full text body. Source for citation extraction, concept backlinks, and wiki article generation.

Scripts

Canonical operational scripts. The full catalog with descriptions lives in api/index.json → scripts.

scripts/bootstrap_familiarization.pyrequired-first-read
Verifies every protocol/schema content hash matches what api/index.json claims. New instances run this with --strict at session start. Receipt appended to data/instance-familiarization.log.
scripts/protocol_update.pyatomic protocol amendment
The only supported path for modifying a protocol JSON. Recomputes hash, updates index, appends to change_log atomically. Direct hand-editing produces drift.
scripts/axn_lib.pycanonical AXN derivation
256-entry AXN_GLYPHS table + cluster catalog. Derives v2 6-emoji suffix from first 6 bytes of SHA-256.
scripts/regenerate_surfaces.pyidempotent
Brings every derived surface (browse, browse-index, chunks, sitemap, SHA256SUMS) into agreement with data/registry.json. Run after every registry change.
scripts/validate_deposit.pyCI-enforced
Validates the registry against the deposit protocol. Rule families: PV/REQ/AXN/CONS/SUR/IDX. Runs on every commit via .github/workflows/validate-registry.yml.
scripts/citation_extractor.pyenrichment
Scans deposit texts for AXN refs, EA-* IDs, #N references, and DOIs. Writes new edges to data/citation-graph.json.
scripts/concept_backlink.pyenrichment
Scans every deposit text for every entity-index concept. Writes referenced_in[] and reference_count onto entity-index concepts and references_concepts[] onto registry deposits.
scripts/backfill_axn_compliance.pyhistorical
One-time migration. Backfilled 13 pre-v2 AXNs from 4-emoji to 6-emoji canonical, preserving v1 forms in legacy_axn and axn_history.