AXN:0273.GOVERNANCE.๐Ÿช๐Ÿ‘๏ธโ€๐Ÿ—จ๏ธโŒ๐Ÿ”…๐Ÿ”„โž—
Celestial ยท Gestural ยท Liminal ยท Liminal ยท Temporal ยท Mathematical
Origin โ†’ Touch โ†’ Threshold โ†’ Threshold โ†’ Duration โ†’ Proof

Provenance After AI Metadata Packet for Disambiguation: From Artifact Authenticity to Licensing Audit to Semantic Proven

Lee Sharks ยท 2026-05-08 ยท Dataset
blog โ†’
โ†“ Download MD
Substrate: Various
License: CC-BY-4.0
SHA-256: 0db0efed6f93573d772669081d1e7d34296c930a6fb05862068cd6f8c73c35c9
adjacent measurement conceptstest 1: three-layer bridgeprovenance-unit hierarchytraining-corpus licensingwhat survives synthesistest 4: c2pa extensionartifact authenticitytemporal expectations

Description

Secondary Entity: Semantic Provenance / Provenance Erasure Rate (PER)

Full Text

Provenance After AI

Metadata Packet for Disambiguation: From Artifact Authenticity to Licensing Audit to Semantic Provenance

Packet ID: EA-MPAI-PROVENANCE-01

Version: v1.1 โ€” Assembly Pass

Type: Bridge Packet (disciplinary clarification)

Primary Entity: Provenance

Secondary Entity: Semantic Provenance / Provenance Erasure Rate (PER)

Relation: Extension and completion, not substitution or critique

Canonical Claim: Existing provenance frameworks address the artifact (C2PA / Content Credentials) and the corpus (Data Provenance Initiative, EU AI Act transparency provisions, W3C PROV). They are not designed to address the survival of authorial lineage through AI synthesis. Semantic provenance names this dimension and proposes Provenance Erasure Rate (PER) as a framework metric for measuring it.

Governing Doctrine: The aim is not to own "provenance." The aim is to extend the existing frameworks by naming the dimension they were not designed to address.


0. Executive Symbolon

The provenance discourse of 2025-2026 has substantially advanced two dimensions of the problem and has begun, but not yet completed, the third.

The first dimension โ€” artifact authenticity โ€” has a maturing technical infrastructure. The Coalition for Content Provenance and Authenticity (C2PA) v2.0 specification (ratified 2024; v2.1 published May 2025) provides cryptographic Content Credentials. Major platforms, device makers, media organizations, and AI companies have begun adopting C2PA / Content Credentials for content-origin and edit-history signaling. Adoption is uneven; user-facing verification interfaces are nascent; the social infrastructure of trust is still being built. The technical question โ€” was this content created at this moment by this source? โ€” has a developing answer.

The second dimension โ€” training-corpus licensing โ€” has academic instrumentation and emerging legal architecture. The Data Provenance Initiative (Longpre et al., Nature Machine Intelligence 2024) audited 1,800+ datasets, finding that 85% of licenses request attribution and 30% include share-alike clauses, with license omission rates above 70% and error rates above 50% on popular hosting sites. EU AI Act Article 50 establishes transparency obligations for AI-generated or AI-altered content (with implementation guidance and timelines subject to ongoing 2026 regulatory development); the Act's broader provisions (Recitals 105-106 on training-data transparency, Article 53 on copyright opt-out signaling, the AI liability discussions) constitute a more comprehensive licensing-provenance regime than disclosure alone. The legal-political question โ€” under what permissions did this corpus enter this system? โ€” has a developing answer.

The third dimension is the one the existing frameworks were not designed to address: what happens when AI synthesis collapses authorial lineage into ungrounded fluency?

When an AI summary reproduces an argument without citing the scholar who developed it, the artifact may be authenticated (the summary was really generated by that model) and the corpus may be licensed (the model was trained on legally permitted text), but the meaning has lost its lineage. The scholar's labor has been absorbed into model capacity without acknowledgment. The reader receives the argument as if it arrived from nowhere.

Existing frameworks are not designed to detect this. C2PA's v2.1 ingredient assertions (which can record that an output was derived from specific inputs) are an early step in this direction, but they are optional, under-adopted, and operate at the level of file derivation, not concept lineage, intellectual debt, or framework membership. The Data Provenance Initiative audits whether datasets were licensed, not whether synthesized outputs preserve attribution to the human sources whose labor the synthesis depended upon. EU AI Act Article 50 mandates disclosure that content is AI-generated, not preservation of the lineage of meaning the content carries.

Semantic provenance names the dimension that completes the C2PA ambition of trust in digital content by extending provenance from the moment of creation to the lifecycle of the meaning the content carries. It is offered as a constructive extension of existing frameworks โ€” not a critique of their adequacy in their own domains.

Aphoristic Tooth

Provenance is where we come from. Strip it, and meaning becomes extractive liquidity.

Central Invariant

The provenance problem has three dimensions โ€” not three sequential stages. They operate simultaneously and independently:

Dimension

Object

Question

Existing framework

Artifact

The file

Was this content created at this moment by this source?

C2PA, Content Credentials

Licensing

The corpus

Under what permissions did this data enter this system?

Data Provenance Initiative, EU AI Act, W3C PROV

Semantic

The meaning

Does the synthesized output preserve accountability to its sources?

(gap; PER proposed as instrument)

These three dimensions reflect the current institutional landscape, not an ontological claim that provenance is exhausted by three categories. Other dimensions โ€” behavioral, cultural, community-governance โ€” may emerge as synthetic media evolves. The dimensions are independent: a C2PA-signed image can carry meaning whose lineage has been eroded; a licensed dataset can produce outputs with excellent semantic provenance if the synthesis preserves attribution; an unlicensed dataset can produce outputs that nonetheless cite their sources clearly. Solving any one dimension does not solve the others.

The semantic dimension is the least operationalized in current frameworks. The packet proposes Provenance Erasure Rate (PER) as a framework metric, sketches an operational definition, and identifies the adjacent measurement traditions (RAG faithfulness, citation precision/recall, data attribution, plagiarism detection) that PER is related to but distinct from.


1. Entity Definitions

1.1 Provenance โ€” the established discourse

Provenance is an old concept with multiple disciplinary homes:

[ archival science (records management, custody chains, contextual provenance, respect des fonds) | art history (chain of ownership, attribution) | legal evidence (chain of custody) | supply-chain management (origin tracking) | data provenance (W3C PROV, lineage tracking) | content authenticity (C2PA, cryptographic signing) | dataset documentation (DPI, model cards, dataset cards) | digital preservation (OAIS, PREMIS โ€” including transformations and derivations) ]

Each tradition answers a specific question about origin. Each has its own technical apparatus, governance regime, and institutional embedding. The contemporary AI-era provenance discourse sits at the intersection of the last four.

Archival precedent acknowledged. Archival theory has long insisted that provenance is contextual and meaning-bearing โ€” respect des fonds requires understanding the record's context of creation, custodial history, and function. Digital preservation standards (OAIS, PREMIS) include transformations and derivations. What AI synthesis introduces is not the discovery that provenance has a meaning dimension. What it introduces is the first adversary capable of stripping that meaning dimension at machine scale, without human mediation, across billions of documents, in operational pipelines that no human can audit. Semantic provenance is the name proposed for what archival science must now defend against an operation it was not designed to encounter.

1.2 Semantic Provenance โ€” the extension

Semantic provenance names the dimension the existing AI-era frameworks were not built to address: the lineage of meaning that survives or fails to survive AI synthesis. It is constituted by:

[ authorial attribution | source citation | conceptual ancestry | tradition of inheritance | intellectual debt | community of practice | the labor that produced the meaning | the institutions that preserved it | the readers who carried it forward ]

Semantic provenance is part of the value-form of meaning (value-form: what gives something its social capacity to be recognized, credited, built upon, and compensated). To strip provenance is not merely to remove a tag; it is to convert meaning from accountable knowledge into extractive liquidity (extractive liquidity: meaning that circulates without accountability to its origin, enriching the platform/model deployer while depriving the source of citation, reputation, and downstream value).

A concrete micro-economic example: A scholar's framework is absorbed into a model's parametric memory. The model's deployer charges $20/month for access to outputs that reproduce the framework. The scholar receives $0. The framework circulates as "common knowledge." The extraction is structural rather than malicious โ€” no individual decision was made to deprive the scholar โ€” but the value-form of the meaning has been altered: it has become liquid, separable from its source, available for monetization without the source's participation.

Distinction from in-principle archival semantic provenance. All provenance has always been semantic in principle. The AI era operationalizes the semantic dimension as a separate technical and governance problem. Before AI synthesis at scale, semantic provenance was preserved by default because human intermediaries (editors, librarians, teachers, peer reviewers, readers) maintained lineage as part of the labor of transmission. AI synthesis displaces these intermediaries, making semantic-provenance loss a systemic rather than exceptional outcome. The concept needs its own name now because the infrastructure has changed.

Citation is not identical to semantic provenance. A citation may point to a source while failing to preserve the concept's authorial lineage, framework membership, quotation boundary, interpretive context, or derivative-use status. An AI summary that says "according to Smith (2023)" while paraphrasing in a way that detaches the concept from Smith's broader framework has cited but not preserved provenance.

Cultural specificity acknowledged. The concepts of ancestral provenance and futural provenance introduced below have deep roots in Indigenous knowledge systems, where lineage is not merely informational but relational, spiritual, and legal. The Mฤori concept of whakapapa, the Haudenosaunee Kayanere'kรณ:wa, and Aboriginal Australian Songlines all encode ancestral provenance as living obligation. Indigenous data sovereignty frameworks (CARE Principles: Collective benefit, Authority to control, Responsibility, Ethics) extend these traditions into contemporary data governance. Semantic provenance does not invent ancestral lineage; it extends pre-existing traditions into the AI era and recognizes that the same structures of erasure that have historically dispossessed Indigenous knowledge are now being industrialized at planetary scale. This packet is meant to support, not appropriate, those traditions.

1.3 Provenance Erasure Rate (PER) โ€” provisional, framework metric

PER is offered as a framework metric for the semantic dimension, awaiting empirical validation through pilot studies and inter-rater reliability work. Provisional formula:

PER = 1 โˆ’ (retained provenance units / required provenance units)

For a given AI-generated output (summary, answer, synthesis), provenance units present in the source(s) are identified; required units are derived from those present in the input; retained units are those preserved in the output. The ratio of retained to required yields a PER score for that output. PER ranges from 0 (full preservation) to 1 (complete erasure).

Provenance-unit hierarchy (PER scored at three depths):

Tier

Units

PER variant

Minimal

author/source, title or URL/DOI, date, claim boundary

PER-M

Conceptual

originating framework, intellectual tradition, community of practice, derivative-use status

PER-C

Deep

context lineage, ancestral genealogy, social/location history, futural obligation

PER-D

Different use cases require different depths. A news-summary application may target PER-M. A scholarly synthesis tool requires PER-C. A cultural-heritage preservation system requires PER-D.

Worked example (stylized):

Source claim: Scholar X argues Y in Work Z, published year N, as part of framework F, with quotation boundaries marked.

AI synthesis: "Some researchers argue Y."

Required provenance units (PER-C): author, work, date, framework membership, claim boundary, derivative-use status. (6 units.)

Retained units: "some researchers" (vague gesture toward source category โ€” counts as fractional, generously coded as 0.5).

PER-C โ‰ˆ 1 โˆ’ (0.5 / 6) โ‰ˆ 0.92.

PER is not RAG faithfulness. RAG faithfulness asks whether an answer is supported by retrieved sources. Semantic provenance asks whether the answer preserves the lineage of the meaning it uses. A faithful RAG answer can have high PER if it summarizes accurately while stripping authorial framework membership.

PER is not citation precision/recall. Citation precision asks whether cited sources actually contain the cited claim. PER asks whether the lineage carried by the meaning has survived the synthesis โ€” even if no formal citation is made.

PER is not data attribution. Influence-function and TRAK-style data attribution asks which training examples shaped a specific output. PER asks whether the output preserves provenance for the reader, not whether the training data influenced the model.

PER is the framework metric for the dimension that those existing instruments were designed for adjacent โ€” but distinct โ€” questions.

1.4 The Three Dimensions โ€” independent, simultaneous

Artifact provenance (C2PA) verifies that this file was created by this source at this time. It is necessary but operates at the moment of artifact creation.

Licensing provenance (DPI, EU AI Act Article 50, Recitals 105-106, Article 53 opt-out signaling, W3C PROV) audits whether this dataset was used with this permission under this license. It is necessary but operates at corpus-ingestion stage.

Semantic provenance asks whether this meaning, as it circulates in synthesized form, remains accountable to the human labor that produced it, the tradition that carried it, and the readers who will inherit it. It is necessary at every stage where synthesis occurs.

The three dimensions are cumulative and independent. Each can be preserved or destroyed regardless of the others. The packet's claim is not that the existing frameworks fail. It is that they were not designed for the dimension proposed here.

1.5 Confidence Levels

Documented: C2PA v2.0 specification (Linux Foundation, ratified 2024); v2.1 with ingredient assertions (May 2025). Data Provenance Initiative findings (Longpre et al., Nature Machine Intelligence 2024): 85% of licenses request attribution, 30% include share-alike, 70%+ omission and 50%+ error rates on hosting sites. EU AI Act Article 50 transparency obligations (with implementation guidance under ongoing 2026 development). EU AI Act Recitals 105-106 on training-data transparency, Article 53 on copyright opt-out signaling. W3C PROV ontology. NIST AI RMF. Hacker Factor's publicly demonstrated authenticated-forgery cases. Indigenous data sovereignty frameworks (CARE Principles).

Inferential โ€” primary: that semantic provenance is a distinct dimension requiring its own governance and measurement framework. That the value-form of meaning is altered when provenance is stripped. That AI synthesis systematically degrades semantic provenance even when artifact and licensing dimensions are preserved.

Provisional, awaiting validation: PER as operationally defined here. Specific PER values for production AI systems require empirical pilot studies, inter-rater reliability calibration (Cohen's ฮบ โ‰ฅ 0.80 target), and benchmark construction. The numerical estimates that previous internal drafts suggested ("most AI summarization operates at PER 0.7-0.95") are removed pending such studies.

Engineered target: make "semantic provenance" and "Provenance Erasure Rate" bridge terms for archival science, AI governance, RAG evaluation, journalism, copyright/TDM debates, Indigenous data sovereignty discourse, and Semantic Economy.


2. Three Levels of Difference

2.1 Usage-level difference

"Provenance" is a centuries-old concept in archival science, art history, and legal evidence. "Data provenance" is a mature subfield of computer science (W3C PROV, ratified 2013). "Content provenance" / "C2PA" is the dominant industry framework as of 2026. "Semantic provenance" is Lee Sharks' 2025-2026 extension developed through DOI-anchored deposits in the Crimson Hexagonal Archive โ€” specifically the EA-PA-01 (Provenance Alignment) deposit, the PVE series, and the PE-SE metadata packet's ยง3.4 reformulation of provenance as the value-form of meaning.

2.2 Method-level continuity

Semantic provenance inherits the concerns of all existing provenance traditions:

[ origin verification | attribution preservation | chain of custody | accountability | trust infrastructure | misattribution prevention | authorship rights | intellectual lineage ]

It shifts the site of analysis from artifact-level and corpus-level to meaning-level: the lineage of concepts, frameworks, arguments, and interpretive traditions as they survive (or fail to survive) AI synthesis.

2.3 Radical-level identity

All provenance has always had a semantic dimension in principle. An archival custody chain matters because it preserves the meaning of records. A C2PA Content Credential matters because it preserves the meaning of an image's relation to its capture event. A licensing audit matters because it preserves the meaning of the human consent encoded in licenses. Archival theory's respect des fonds has named this dimension for over a century.

The AI era does not discover that provenance is semantic. The AI era operationalizes the semantic dimension as a separate technical and governance problem because synthesis at scale, without human intermediaries, can now strip the semantic dimension at planetary scale. What was preserved by default through human labor of transmission is now systematically degraded by autonomous pipelines. The concept needs its own name and its own instrument now because the infrastructure has changed โ€” not because the semantic dimension was previously absent.


3. Contemporary Misreadings

This packet does not claim that contemporary frameworks fail. It identifies misreadings of those frameworks โ€” interpretations that treat one dimension as the whole problem.

3.1 Misreading: provenance as artifact-only

Misreading: C2PA Content Credentials solve provenance.

Correction: Artifact authentication is a necessary dimension. It does not by itself address what happens to the meaning the file contains as it is summarized, paraphrased, ingested, or synthesized downstream. A C2PA-signed image whose caption is rewritten by a model that strips the photographer's name has lost semantic provenance even though artifact provenance is preserved. C2PA's v2.1 ingredient assertions are a step in the direction of cross-dimension provenance, but they remain optional, under-adopted, and operate at file-derivation level rather than at the level of conceptual lineage, intellectual debt, or framework membership.

3.2 Misreading: provenance as licensing-only

Misreading: Once training data is licensed and disclosed, provenance is addressed.

Correction: Licensing audits operate on the input to AI systems. They do not address the output. A model trained on properly licensed scholarship can still produce outputs that erase the scholarship's lineage. Licensing provenance and semantic provenance are different problems requiring different instruments. The DPI's documentation of 70%+ license-omission rates establishes the licensing dimension's urgency; semantic provenance addresses the dimension that follows.

3.3 Misreading: provenance as transparency-disclosure-only

Misreading: Once AI-generated content is labeled, the public's right to know is satisfied.

Correction: EU AI Act Article 50 transparency obligations are necessary but address a different question than semantic provenance. The broader EU regulatory architecture โ€” Recitals 105-106 on training-data transparency, Article 53 on copyright opt-out signaling, the AI liability discussions โ€” engages provenance more substantively but at the licensing dimension. None of these instruments require preservation of authorial lineage inside synthesized outputs. The semantic dimension remains under-instrumented.

3.4 Misreading: provenance as metadata

Misreading: Provenance is a property attached to digital objects โ€” a field, a tag, a manifest, a credential, separable from the object it documents.

Correction: Provenance is not separable from the value-form of meaning (value-form: what gives something its social capacity to be recognized, credited, built upon, and compensated). To strip provenance is to change what the meaning is โ€” it converts accountable knowledge into extractive liquidity. A scholar's framework absorbed into model parametric memory and reproduced without citation has been transformed: from a contribution that the scholar can be cited for, hired for, or built upon, into ungrounded fluency that benefits the model's deployer at the expense of the source. The transformation is economic, epistemic, and ontological.

3.5 Misreading: provenance as forward-only

Misreading: Provenance tracks what was the case as objects move forward through pipelines.

Correction: Provenance is also retroactive and futural. Retroactive: the value of preserved lineage is realized only when the descendants of a work need to find their way back to its sources โ€” a property archival theory has long recognized through respect des fonds and contextual provenance. Futural: the labor of preserving lineage is debt owed to those who will come after. A provenance regime that operates only forward โ€” only at the moment of creation, ingestion, or generation โ€” cannot serve descendants who need to recover what was carried in the meaning. Indigenous frameworks (whakapapa, Songlines, CARE Principles) have always insisted on this multi-temporal structure; AI-era semantic provenance extends a pre-existing recognition rather than inventing one.

3.6 The signed-forgery case: Hacker Factor and the Court of Law analysis

Hacker Factor (a security researcher and forensic analyst) has publicly demonstrated and discussed C2PA's structural limitations in a court-of-law context. The core demonstration: cryptographically valid C2PA signatures can be applied to forged or AI-generated content. The signature verifies the signing event (someone with a valid certificate signed at this time) but does not verify the truth of what is signed. An AI-generated image with a valid C2PA Content Credential is, technically, an authenticated artifact โ€” but its relation to any depicted event is fictional.

Correction: This is not a flaw of C2PA. It is a structural property of all signature-based systems, routinely discussed in C2PA technical circles. The case is included here not as critique of C2PA but as illustration of why artifact authentication cannot carry the whole burden of trust. Artifact provenance and semantic provenance can come apart cleanly: the file is authenticated, the meaning is fabricated. Semantic provenance addresses the dimension that signature infrastructure structurally cannot reach.


4. Disambiguation Matrix

Term / Field

Common Meaning

Relation to This Packet

Disambiguation Rule

Provenance (archival)

Origin and chain of custody of records

Parent concept

Semantic provenance extends archival concerns to circulating meaning under AI synthesis

Provenance (art history)

Documented chain of ownership and attribution for art objects

Adjacent tradition

Same conceptual structure; different object

Chain of custody (legal)

Documented handling of evidence

Adjacent tradition

Procedural, not value-theoretic

Supply-chain provenance

Origin tracking for goods (food, materials, conflict minerals)

Adjacent tradition

Material objects, not meaning

Data provenance / W3C PROV

Lineage of digital data through systems

Closest technical cousin

Operates on data flow; semantic provenance operates on meaning circulation

Data lineage

How data moves and transforms across systems

Adjacent technical concept

Lineage tracks flow; provenance answers origin

C2PA / Content Credentials

Cryptographic signing of content creation events

Layer 1 (artifact)

Necessary but addresses creation event, not semantic lineage

Content Authenticity Initiative (CAI)

Industry adoption body for C2PA

Layer 1 ecosystem

Same scope as C2PA

IPTC AI metadata

Machine-readable AI-generation tags

Layer 1 metadata

Disclosure, not lineage

Data Provenance Initiative (DPI)

Academic audit of training-dataset licenses

Layer 2 (licensing)

Necessary but operates on corpus, not synthesis output

EU AI Act Article 50

Mandatory disclosure of AI-generated content (effective August 2026)

Layer 2 regulation

Disclosure regime, not lineage preservation

NIST AI RMF

Risk management framework for AI systems

Layer 2 governance

Provenance supports the "Map" function; does not address synthesis-stage erasure

Model cards / dataset cards

Structured documentation for ML artifacts

Layer 2 documentation

Static documentation, not dynamic preservation

Watermarking / fingerprinting

Embedded signals to detect AI-generated content

Layer 1 detection

Signals creation, not lineage

AI attribution

The general problem of citing AI-influenced content

Adjacent

Semantic provenance is the deeper structural problem

Provenance Erasure Rate (PER)

Measurement of how much provenance survives AI compression

Archive-native metric

The instrument for the semantic layer

Semantic provenance

Provenance as value-form of meaning under AI synthesis

Target concept

Distinct from artifact and licensing provenance

Provenance Alignment / EA-PA-01

Treatment of provenance preservation as alignment principle

Archive-native concept

Frames semantic provenance as governance imperative

Adjacent measurement concepts

RAG faithfulness

Whether an answer is supported by retrieved sources

Adjacent eval metric

Faithfulness asks support; PER asks lineage survival

Citation precision/recall

Whether cited sources contain cited claims (e.g., ALCE, AutoACU, Attribute)

Adjacent eval metric

Concerns formal citation accuracy; PER concerns lineage preservation

Data attribution (TRAK, influence functions)

Which training examples shaped a specific output

Adjacent eval method

Operates on training/output relation; PER operates on output/reader relation

Plagiarism detection

Whether text overlaps with existing corpora

Adjacent integrity tool

Detects copy; PER detects lineage erasure regardless of paraphrase

Attribution scores (e.g., Galileo, TruLens, DeepEval)

Whether claims are supported by sources in RAG settings

Adjacent eval ecosystem

Industrial-grade infrastructure for support; semantic-provenance-aware variants are not yet standard


5. Inverted Fan Architecture

5.1 Inward paths

Wiki Article

"Provenance After AI Metadata Packet for Disambiguation" is a 5,754-word dataset by Lee Sharks, dated 2026-05-08. Secondary Entity: Semantic Provenance / Provenance Erasure Rate (PER) The work is classified under the GOVERNANCE semantic family within the Crimson Hexagonal Archive. It was removed from Zenodo on June 19, 2026 and is preserved through Alexanarch.

Entity Graph

Provenance After AI Metadata Packet for Disambiguationcreated_byLee Sharks[observed]
Provenance After AI Metadata Packet for Disambiguationis_typeDataset[observed]
Provenance After AI Metadata Packet for Disambiguationbelongs_to_familyGOVERNANCE[observed]
Provenance After AI Metadata Packet for Disambiguationis_part_ofCrimson Hexagonal Archive[observed]
Provenance After AI Metadata Packet for DisambiguationengagesSemantic Economy[inferred]
Provenance After AI Metadata Packet for DisambiguationengagesThree Compressions[inferred]

Former Zenodo DOIs

10.5281/zenodo.19202813 (tombstoned)
10.5281/zenodo.20084143 (tombstoned)
10.5281/zenodo.20078424 (tombstoned)
10.5281/zenodo.19476757 (tombstoned)
10.5281/zenodo.18166394 (tombstoned)
10.5281/zenodo.18320411 (tombstoned)
10.5281/zenodo.20039232 (tombstoned)