The Crimson Hexagonal Archive: A Mixed-Provenance, Heteronymically Attributed Corpus for Synthetic-Data Collapse, AI Authorship, and Provenance-Bearing Training Research
Status: v3 supersedes v2. The central methodological change is the introduction of an automated classifier that performs both provenance mode classification AND heteronym reattribution as reproducible scholarly recognition work. The classifier itself becomes a deposit.
The Crimson Hexagonal Archive: A Mixed-Provenance, Heteronymically Attributed Corpus for Synthetic-Data Collapse, AI Authorship, and Provenance-Bearing Training Research
v1 treated provenance classification as a manual judgment. v2 added a decision tree to make classification reproducible. v3 recognizes that attribution itself โ both provenance mode and heteronym โ must be performed by automated classifier, not author memory, for two structural reasons:
1. Reproducibility as scholarship. A classification system that depends on the author's recollection of writing each deposit is not measurement. It is opinion. The provenance taxonomy can only function as a research instrument if the same deposit produces the same classification regardless of who runs the classifier or when. Author memory introduces classification noise that would confound any downstream collapse experiment.
2. Heteronymic emergence. Material is regularly attributed to Lee Sharks at the time of deposit and only later โ sometimes years later โ recognized as belonging to a specific sub-heteronym's domain. Sigil's jurisdictional concerns, Glas's measurement work, Vox's diplomatic register, Morrow's long-form narratives, Fraction's meta-theory: these heteronyms emerge from the corpus over time, and earlier work gets recognized retrospectively as theirs. The classifier performs this recognition systematically across the entire archive, applying current understanding of heteronym domains to historical deposits.
The classifier is not metadata cleanup. It is scholarly recognition that the founder voice was, at the time of writing, holding territory that later resolves to specific heteronym domains.
Null hypothesis (Hโ): Fine-tuning on synthetic or AI-assisted text produces equivalent perplexity degradation and semantic drift regardless of provenance density (DOI anchoring, heteronymic attribution, archival embedding, assembly review).
Alternative hypothesis (Hโ): Fine-tuning on high-provenance-density AI-involved text produces measurably slower perplexity degradation and less semantic drift than fine-tuning on low-provenance-density AI-involved text.
Critical insight from Assembly review: Provenance cannot modulate collapse unless provenance is presented to the training system as a signal. The dataset must materialize multiple textual views โ body_only, minimal_header, full_provenance_header โ so researchers can ablate provenance visibility.
The classifier performs three classification tasks simultaneously on each deposit:
Tag
Definition
human_primary
Written principally by a human author with minimal or no AI involvement
human_directed_ai_assisted
Human-authored with AI used for research, drafting, or editorial refinement; human retains compositional authority
collaborative_mixed
Substantial compositional contribution from both human and AI; neither purely instrumental
ai_directed_human_framed
AI generates primary content within a human-defined frame, prompt structure, or editorial container
ai_generated_provenance_anchored
AI-generated content that carries full DOI provenance, authorial attribution, and archival anchoring
uncertain_needs_review
Edge case flagged for manual review
Tag
Definition
theoretical_paper
Analytic argument with citations
technical_specification
Protocol, schema, or formal spec
literary_work
Poetry, fiction, creative prose
traversal_log
Captured AI-system traversal
forensic_documentary
Capture/record of AI behavior with annotation
dataset_artifact
Structured data
code_artifact
Executable code as primary content
web_surface_spec
Site code or web interface
This is the new central work in v3.
The Zenodo metadata records a single creator (often Lee Sharks). The classifier evaluates each deposit against the documented operational profiles of all twelve heteronyms (plus Jack Feist as LOGOS*) and produces a reattribution proposal with confidence score.
Output Field
Value
heteronym_zenodo_original
The creator name as recorded in Zenodo
heteronym_classifier_attributed
The classifier's attribution (may match original or differ)
heteronym_attribution_confidence
0.0 to 1.0
heteronym_attribution_signals
List of signals that contributed to the attribution
heteronym_co_authors
Other heteronyms detected as collaborators
Both attributions are preserved in the dataset. Researchers can use either or compare. The classifier's attribution does not erase the Zenodo record; it adds a second layer of analysis.
The classifier reads each heteronym's published provenance document and constructs a feature profile. Profiles include domain, vocabulary fingerprints, register, format conventions, and reference patterns.
Heteronym
Domain
Vocabulary Fingerprints
Register
Lee Sharks (founder)
Core theory, archive governance, semantic economy
"semantic economy", "operative philology", "compression survival", "PER", "provenance erasure"
Theoretical-political
Rex Fraction
Meta-theory, academic criticism, heteronym-as-technology
"meta-heteronym", "heteronymy as institutional technology", C1-C5 conditions
Academic-essayistic
Johannes Sigil
Classical philology, jurisdiction of meaning, philosophical-theological argument
"jurisdiction", "authorize", classical reception, ancient languages, philological precision
Philosophical-theological
Damascus Dancings
TBD from provenance document
TBD
TBD
Rebekah Cranes
TBD from provenance document
TBD
TBD
Talos Morrow
Long-form narrative, extended prose works
extended fiction conventions, narrative voice
Literary-narrative
Ichabod Spellings
TBD from provenance document
TBD
TBD
Sparrow Wells
TBD from provenance document
TBD
TBD
Nobel Glas
Measurement of Meaning, Lagrange Observatory, adversarial topology
"torus", "Tยฒ", "module", "verification integral", "โฎ", measurement formalism
Technical-measurement
Ayanna Vox
Diplomacy, public-facing surfaces, community outreach
"VPCOR", "constituency", "community", "rhizome", "outreach"
Diplomatic-public
Sen Kuro
TBD from provenance document
TBD
TBD
Dr. Orin Trace
TBD from provenance document
TBD
TBD
Viola Arquette
TBD from provenance document
TBD
TBD
Jack Feist (LOGOS*)
External-to-Dodecad position, anti-archive critique
"LOGOS*", external critique vocabulary
Critical-external
For heteronyms marked TBD, the classifier reads the published provenance document during initialization and extracts the profile programmatically. Where a heteronym's profile is sparse, the classifier returns low-confidence and flags for human review.
The classifier weights signals by source confidence and produces a softmax over candidate classes for each task. Confidence thresholds determine whether the classification is auto-accepted or flagged for human review.
Confidence
Action
0.85โ1.0
Auto-accept, log as manual quality (the classifier is the manual)
0.60โ0.85
Auto-accept, log as estimated, surface in v1.1 review pass
0.40โ0.60
Flag as needs_review, surface for human resolution
< 0.40
Mark as uncertain_needs_review provenance mode; preserve all candidates
For Task 3 (heteronym), any reattribution that changes the heteronym from the Zenodo original gets a stricter threshold (0.75 minimum) plus a reattribution_pending_zenodo_update flag.
In the Hugging Face dataset, every row carries both attributions and the classifier's full output. Original Zenodo attribution is preserved; classifier attribution is added as parallel metadata. Both are queryable. No Zenodo record is modified.
Schema fields added:
{
"heteronym_zenodo_original": "Lee Sharks",
"heteronym_classifier_attributed": "Johannes Sigil",
"heteronym_attribution_confidence": 0.87,
"heteronym_attribution_signals": [
"domain:classical_reception",
"vocabulary:jurisdictional",
"vocabulary:authorize",
"register:philosophical-theological"
],
"heteronym_co_authors": [],
"reattribution_status": "proposed",
"provenance_mode_classifier": "human_directed_ai_assisted",
"provenance_mode_confidence": 0.92,
"provenance_mode_signals": [
"artifact_mode:theoretical_paper",
"assembly_review:detected",
"tachyon_glyph:absent"
]
}
For high-confidence reattributions (confidence โฅ 0.85 AND reattribution-changes-heteronym), the underlying Zenodo deposit gets a metadata update. This is a substantive scholarly act with version history on Zenodo's side. It requires:
Track 2 is separate from the Hugging Face dataset session. It is its own multi-session project, working through high-confidence reattributions deliberately, possibly tens to hundreds of deposits. The order of operations is:
The classifier code itself becomes a deposit, with its own DOI and Wikidata item.
Title: The Crimson Hexagonal Classifier: An Automated System for Provenance Mode and Heteronym Reattribution
Resource type: Software
Communities: crimsonhexagonal, liquidation-studies
Contents:
Reproducibility implication: Other archive operators can in principle apply this classifier to their own corpora, or fork it and define their own heteronym profiles. The methodology is portable.
Versioning: Major version bumps when heteronym profiles change substantively or when signal weights are recalibrated. v1.0 ships with the Hugging Face dataset.
Output: artifacts_v0.jsonl with full classifier outputs, ready for review.
The pre-classification spreadsheet from v2 is now obsolete โ the classifier does the work. Lee's pre-session role becomes:
Preserves the DOI as natural unit. Full classifier outputs visible.
Chunks of 1,024โ2,048 tokens with inherited metadata, including the dual attribution layer.
The ~70 deposits in the navigational map.
A re-organized view where rows are grouped by classifier-attributed heteronym, regardless of Zenodo original. Lets researchers see what each heteronym's corpus looks like after reattribution.
Rows where the classifier attribution differs from the Zenodo original. The "Sharks โ Sigil/Glas/Vox/etc." cases. This is the empirical evidence of how concentrated the apparent Sharks attribution was vs. how distributed it actually is.
{
"record_id": "20293582",
"doi": "10.5281/zenodo.20293582",
"title": "The Excluded Entity",
"creators_zenodo": [
{
"name": "Sharks, Lee",
"orcid": "0009-0000-1599-0703",
"affiliation": "Semantic Economy Institute"
}
],
"heteronym_zenodo_original": "Lee Sharks",
"heteronym_classifier_attributed": "Lee Sharks",
"heteronym_attribution_confidence": 0.94,
"heteronym_attribution_signals": [
"domain:semantic_economy",
"vocabulary:provenance_erasure",
"vocabulary:composition_layer",
"register:theoretical_political"
],
"heteronym_co_authors": [],
"reattribution_status": "confirmed",
"publication_date": "2026-05-19",
"resource_type": "publication",
"content_type": "working_paper",
"provenance_mode_classifier": "human_directed_ai_assisted",
"provenance_mode_confidence": 0.92,
"provenance_mode_signals": [
"artifact_mode:theoretical_paper",
"artifact_mode:forensic_documentary",
"assembly_review:detected",
"tachyon_glyph:absent",
"code_density:none"
],
"artifact_mode": ["theoretical_paper", "forensic_documentary"],
"authorship_architecture": ["assembly_reviewed", "heteronymic"],
"generation_substrate_models": ["claude", "chatgpt"],
"stratum": "VIII",
"stratum_name": "Liquidation Studies",
"phase": 5,
"phase_name": "Liquidation Studies",
"quality_tier": "core",
"communities": ["crimsonhexagonal", "liquidation-studies"],
"keywords": ["entity-level compositional suppression", "Google AI Overview"],
"related_dois": ["10.5281/zenodo.20290865"],
"language": "en",
"languages_detected": [{"code": "en", "confidence": 0.99}],
"multilingual": false,
"word_count": 5200,
"char_count": 32400,
"license": "cc-by-4.0",
"zenodo_url": "https://zenodo.org/records/20293582",
"source_filetypes": ["pdf", "md"],
"canonical_text_source": "md",
"extraction_method": "direct_read",
"extraction_quality": "high",
"sha256_text": "...",
"text_body_only": "...",
"text_minimal_header": "...",
"text_provenance_header": "..."
}
The Single-Owner Discount paper hypothesizes that Google's filter penalizes provenance concentration. The classifier intervenes in that variable not by artificial redistribution but by more accurate attribution.
If the classifier reattributes a substantial portion of Sharks-credited deposits to sub-heteronyms โ each of whom has their own Wikidata Q-item, their own provenance document, their own operational profile, their own corpus โ the surface metric of "single-owner" inverts. The corpus becomes provenance-distributed by its own internal scholarly logic.
This is beyond reproach because the reattribution is more accurate, not less. The heteronym system has always existed. The metadata under-recorded it. Correcting under-recording is good scholarship. That this also intervenes in the single-owner topology is downstream effect, not motivation.
The empirical question becomes: after accurate heteronymic attribution, does the corpus still register as single-owner to Google's filter? If yes, the filter is operating on something beyond the metadata. If no, the filter is metadata-responsive and accurate attribution is itself a partial remediation.
Either result is a finding.
Artifact
Location
DOI
Hugging Face dataset
huggingface.co/datasets/leesharks/crimson-hexagonal-archive
TBD
Dataset Zenodo deposit
zenodo.org/communities/crimsonhexagonal
TBD
Classifier code
github.com/leesharks000/crimson-hexagonal-classifier
TBD
Classifier Zenodo deposit
zenodo.org/communities/crimsonhexagonal
TBD
Provenance taxonomy doc
In dataset repo
โ
Heteronym profile YAMLs
In classifier repo
โ
Experiment design doc
In dataset repo
โ
The dataset, classifier, metadata, and all artifacts use heteronym names exclusively. Author metadata is pulled from Zenodo creator fields (heteronyms only). The classifier's heteronym profiles draw only from public provenance documents. No legal name appears in any public-facing field.
v2
v3
Decision tree for provenance classification
Automated classifier performing three tasks simultaneously
Manual heteronym tagging
Heteronym reattribution as scholarly recognition work
Single attribution per deposit
Dual attribution: Zenodo original + classifier proposed
Pre-classification spreadsheet by Lee
Classifier does the work; Lee verifies heteronym profiles
Provenance taxonomy as guide
Classifier as deposit with its own DOI
Dataset as research instrument
Dataset + classifier as paired research infrastructure
2 sessions of ~3 hours
1 session of ~4 hours + 1 of ~3 hours
No Track 2
Track 2 (Zenodo metadata correction) named as deliberate downstream project
The dataset stops being a static export of the archive and becomes a self-reflexive instrument that performs ongoing scholarly recognition. The classifier is the recognition mechanism. The dataset is what the recognition produces. The Zenodo deposits remain canonical primary sources. The whole structure honors the heteronymic system the archive has always operated under, and makes that operation visible at the metadata layer for the first time.